# Data preparation
Contents:
- Add names. (done)
- Remove outliers - people who have probably incorrectly answered questions and people who have randomly answered questions. (done - check errors)
- Data preparation for Together Apart - deciding and selecting columns that are relevant for friend-finding. (done)
- Final cleaning - analysing and solving any final issues; exporting the final dataframe as a separate csv. (done)

# Normalisation, Recommender System, and Data Viz
Contents:
- User recommendation - recommender system for matching users with 10 others based on their interest registered through the questionaire registration form. (almost done - needs updating and a check)
- Data visualisation - view for user to aid understanding of their match percentage and a view for the developers to enable monitoring. (to do)

# - Add names
The Young People Survey dataset is anonymous. For our app, we will have the participants enter their name into the questionnaire. To make this dataset mimic the dataset we will create in our app, we need to give the participants names.

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Read the  data and store as a Pandas dateframe
df = pd.read_csv("responses.csv")

In [None]:
df.describe()

In [None]:
# column labels
df.columns

In [None]:
# Shape of data
df.shape

We need to know which participants are male and which are female, so that we can add names of the correct gender.

In [None]:
# number of males, females, and unknown gender
n_male = (df['Gender'] == 'male').sum()
n_female = (df['Gender'] == 'female').sum()
n_unknown = df['Gender'].isnull().sum()
print(n_male, n_female, n_unknown)

There are 6 participants who did not give their gender, we'll give these people male names as there are fewer males in the dataset.

We've created a list of male names and a list of female names which we can import and clean up as follows.

In [None]:
# Import name lists
df_female_names = pd.read_csv('female_names.csv')[:n_female]
df_male_names = pd.read_csv('male_names.csv')[:n_male + n_unknown]

# remove non-ascii characters which have occurred because the names were copied from a website.
df_male_names['Name'] = df_male_names['Name'].apply(lambda x: x.replace('\xa0', ' '))
df_female_names['Name'] = df_female_names['Name'].apply(lambda x: x.replace('\xa0', ' '))

In [None]:
# Add new column to main dataframe for name
df['Name'] = ''

In [None]:
# Set the indices of the male name dataframe to be the indices of the males (and unknowns) in the
# main dataframe.
df_male_names = df_male_names.set_index(df.index[(df['Gender'] == 'male') | df['Gender'].isnull()])

# Do the same for females.
df_female_names = df_female_names.set_index(df.index[(df['Gender'] == 'female')])

In [None]:
# Add names into main dataframe.
df['Name'] = df_male_names
df.loc[df['Gender'] == 'female', 'Name'] = df_female_names

Now we have a column of names with the appropriate genders.

In [None]:
# Print the gender and names of the last few rows.
df[['Gender', 'Name']].tail()

# - Remove outliers

Explore data for abnormal values. Only the Age, Height, Weight, and Number of siblings questions allowed the participant to enter any value. So let's check the extreme values of these columns to see if there are any anomolies.

In [None]:
df[['Age', 'Height', 'Weight', 'Number of siblings']].agg(['min', 'max'])

Age and number of siblings look ok. The height is measured in cm and the weight in kg, so the min height and max weight look like errors. Let's look more closely at these values.

In [None]:
df[df['Height'] < 120]

That 62cm height looks like an error, given that she weighs 55kg. Let's replace it with NaN.

In [None]:
df.loc[676,'Height'] = np.NaN

In [None]:
df[df['Weight'] > 130]

There are very few people with a weight of 150kg or more. The participants may have misread the units, thinking they were entering 150lb instead. Let's replace these extreme values with NaN

In [None]:
df.loc[885,'Weight'] = np.NaN
df.loc[992,'Weight'] = np.NaN

Next, let's check whether there are any people who have incorrectly answered whether they are an only child and how many siblings they have. If someone is an only child, they have no siblings.

In [None]:
df[['Only child', 'Number of siblings']][(df['Only child'] == 'yes') & (df['Number of siblings'] >= 1)]

So there are 95 people who either incorrectly answered whether they are an only child or the number of siblings they have. Let's change these values to NaN for these participants.

In [None]:
i_error = df[['Only child', 'Number of siblings']][(df['Only child'] == 'yes') & (df['Number of siblings'] >= 1)].index
df.loc[i_error, ['Only child', 'Number of siblings']] = np.NaN

Now let's identify people who are likely to have randomly answered the questions. For this, we can use the [local outlier factor algorithm](https://en.wikipedia.org/wiki/Local_outlier_factor). TO DO

In [None]:
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor

In [None]:
df_float_only = df.dropna().loc[:, df.dtypes == float]
df_float_only_normalized = (df_float_only - df_float_only.min())/(df_float_only.max() - df_float_only.min())


# fit the model for outlier detection
clf = LocalOutlierFactor(n_neighbors=20)

clf.fit_predict(df_float_only_normalized)
X_scores = clf.negative_outlier_factor_

# # np.where(X_scores < -1.2)

df_dropna = df.dropna()
df_dropna['LOF score'] = X_scores

In [None]:
df_dropna.shape

In [None]:
X_scores.shape

In [None]:
print(df_dropna[df_dropna['LOF score'] < -1.4].iloc[0].to_string())

# - Data preparation for Together Apart
In this section, we decide on the columns that are relevant for friend-finding.We also get the data ready to match with the Together Apart registration form/questionaire.

## Viewing and analysing the data

In [None]:
# Reading the columns

print(df.columns.values)

In [None]:
# Checking the shape of the dataset

df.shape

In [None]:
# Checking the mean value, type, and length of some columns (overwriten)

df['Music'].mean

## Deciding on the columns we'll use

For the purposes of this project we decided to continue with the following columns (please see the lists below).

In the registration form they will be grouped in two categories: "Activities" and "Interesting Subjects".

The form will provide a 1 to 5 linkert scale for each subject (which is a column name in this dataset), where 1 means "not interested" and 5 - "very interested". The user will be able to choose the level of their interest, to then match up with someone, based on similarities of what they would like to do with their new buddy.


#### Activities (Let's Do This - Together Apart) (I am looking for an activity buddy.)
* Dancing
* Singing ("Musical instruments" in this dataset)
* Writing
* Meditation ("Passive sport" in this dataset)
* Playing games ("Fun with friends" in this dataset)
* Active sports (such as yoga; "Active sport" in this dataset)
* Being creative ("Art exhibition" in this dataset)
* Acting ("Theatre" in this dataset)
* Cooking ("Healthy eating" in this dataset)
* Gardening
* Pets

#### Interesting Subjects (Let's Talk - Together Apart) (I would like to talk about this with a buddy.)
* Music
* Movies
* Reading
* Foreign languages
* Daily events
* Celebrities
* Science and technology
* Future goals ("Thinking ahead" in this dataset)
* Sharing my past ("Changing the past" in this dataset)
* Dreams
* Loneliness
* Health
* Mental wellbeing ("Mood swings" in this dataset)
* Life struggles

### Renamed columns

In [None]:
# Implementing the decision above by altering the column names
# and saving those changes in a new dataframe: df_col_renamed.

df_col_renamed = df.rename(columns={'Musical instruments': 'Singing',
                           'Passive sport': 'Meditation',
                           'Fun with friends': 'Playing games',
                           'Active sport': 'Active sports',
                           'Art exhibitions': 'Being creative',
                           'Theatre': 'Acting',
                           'Healthy eating': 'Cooking',
                           'Thinking ahead': 'Future goals',
                           'Changing the past': 'Sharing my past',
                           'Mood swings': 'Mental wellbeing'})

print(df_col_renamed.columns.values)

### Final shortened dataframe

In [None]:
# Here I wanted to drop the unused columns and save the final vs into a new file
# but soon realised it would be much faster to just create a new df with listed columns (ta - for together apart :D) 

df_ta = df_col_renamed[['Name', 'Dancing', 'Singing', 'Writing', 'Meditation',
                        'Playing games', 'Active sports', 'Being creative',
                        'Acting', 'Cooking', 'Gardening', 'Pets', 'Music',
                        'Movies', 'Reading', 'Foreign languages', 'Daily events',
                        'Celebrities', 'Science and technology', 'Future goals',
                        'Sharing my past', 'Dreams', 'Loneliness', 'Health', 'Mental wellbeing', 'Life struggles']]

print(df_ta.columns.values)

# - Final cleaning

In [None]:
#Looking at the full final dataframe - uncomment to prints

# pd.set_option('display.max_rows', None)
# pd.set_option('display.max_columns', None)
# pd.set_option('display.width', None)
# pd.set_option('display.max_colwidth', None)

# print(df_ta)

From looking at this dataframe, we can notice some issues that need to be solved:

* "Dreams" column is of a different type
* The dataset contains missing values 
* "Name" column has trailing

### Type casting

In [None]:
# Checking the types of final columns

print (df_ta.dtypes)

In [None]:
# Changing the Dreams column to be of float type.

df_ta = df_ta.astype({'Dreams': 'float64'})
print(df_ta.dtypes)

### Imputation of missing values


In [None]:
# Checking the number of NaNs

df_ta.isnull().sum().sum()

In [None]:
# Replacing NaNs, with the most frequent value of the columns (axis 0) that contain the missing values:

df_ta = df_ta.apply(lambda x:x.fillna(x.value_counts().index[0]))

# Checking the number of NaNs after the change

df_ta.isnull().sum().sum()

### Trailing removal

In [None]:
# Cleaning up the whitespace in the "Name" column

df_ta['Name'] = df_ta['Name'].apply(str.strip)

## Final Checks

In [None]:
# Looking at the full final dataframe

print(df_ta)

In [None]:
df_ta.describe()

In [None]:
# Checking the shape of the dataset

df_ta.shape

### All looks good!

In [None]:
# Merging the final cleaned dataframe ("df_ta") with the "df" for easier use in future

df = df_ta

In [None]:
# Exporting the final version of the dataframe as a .csv file
# (commented out so it won't save on another run automatically)

#df.to_csv('TA_PreData.csv')

### Notes

We have 25 subjects that will be used in the registration form and 1010 entries for each that we can already work from to normalise the scores, implement machine learning algoritm to match users and create data visualisation for us and the user.

# - User recommendation
In this section, we use collaborative filtering to recommend users to other users based on the similarity of their interests.

In [None]:
df.describe()

Some columns are not numerical. We want to apply collaborative filtering to only the columns that are numerical and relevant to matching.

In [None]:
df_cf = df.iloc[:,:140].select_dtypes(include=['float64', 'int64'])

In [None]:
df_cf.isnull().sum().sum()/(df.shape[0]*df.shape[1])

Proportion of unanswered questions is small, so we should be able to use all the people for the recommender system.

In [None]:
#Centre each row
df_cf_centred = df_cf.sub(df_cf.mean(axis=1), axis=0)

In [None]:
df_cf_centred

In [None]:
# fill nans with 0.0 (mean of centred rows)
df_cf_centred = df_cf_centred.fillna(0.)

In [None]:
from sklearn.neighbors import NearestNeighbors

n_neighbors = 10

#make an object for the NearestNeighbors Class.
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=n_neighbors, n_jobs=-1)

# fit the dataset
model_knn.fit(df_cf_centred)

In [None]:
distances, indices = model_knn.kneighbors(df_cf_centred)

In [None]:
indices

In [None]:
distances

In [None]:
distances[:,1:].max()

In [None]:
np.where(distances == distances[:,1].min())

In [None]:
indices[449,:]

In [None]:
df_cf_centred.iloc[[449,453],:]

In [None]:
# Aim: write a function where you enter your name and it lists the N best matches, their match score,
# plots the match scores, plots the interest trees.

def find_matches(index=None, name=None, n=10):
    if name is None and index is None:
        print('Enter an index or name.')
    elif name is None:
        name = df.iloc[index]['Name']
    elif index is None:
        index = list(df[df['Name'] == name].index)[0]
    
    match_indices = indices[index, 1:]
    match_names = list(df.iloc[match_indices]['Name'])
    match_distances = list(distances[index, 1:])
    
    return match_names, match_distances


In [None]:
name = 'Sally Abraham'

In [None]:
list(df[df['Name'] == name].index)[0]

In [None]:
df.iloc[[1,2,3,40]]['Name']

In [None]:
find_matches(name=name)

In [None]:
find_matches(index=0)

In [None]:
import matplotlib.pyplot as plt

In [None]:
x = range(9)
plt.bar(x, (1 - find_matches(index=0)[1])*100)
plt.xticks(x, find_matches(index=0)[0], rotation='vertical')
plt.ylim([0, 100])
plt.xlabel('Top matches')
plt.ylabel('Match percentage')
plt.show()