# Data analysis to match like-minded friends
We are using an open access dataset aquired from a questionnaire of 1010 young Slovakians. The dataset is called Young People Survey and is available through Kaggle [here](https://www.kaggle.com/miroslavsabo/young-people-survey).

Possible projects:
- Clustering - find natural groupings of people.
- Recommender system - recommend people to eachother based on how many shared interests they have.
- Predicting questionnaire responses based on other questions. Could be useful for if someone only partially fills in the questionnaire.
- Visualisation - What data do we need to visualise and how can we do that best?

## Data preparation
Contents:
- Adding names (done)
- Selecting columns that are relevant for friend-finding. (to do)
- Remove outliers - people who have probably incorrectly answered questions and people who have randomly answered questions. (to do)

### Adding names
The Young People Survey dataset is anonymous. For our app, we will have the participants enter their name into the questionnaire. To make this dataset mimic the dataset we will create in our app, we need to give the participants names.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Read the  data and store as a Pandas dateframe
df = pd.read_csv("responses.csv")

In [3]:
df.describe()

Unnamed: 0,Music,Slow songs or fast songs,Dance,Folk,Country,Classical music,Musical,Pop,Rock,Metal or Hardrock,...,Shopping centres,Branded clothing,Entertainment spending,Spending on looks,Spending on gadgets,Spending on healthy eating,Age,Height,Weight,Number of siblings
count,1007.0,1008.0,1006.0,1005.0,1005.0,1003.0,1008.0,1007.0,1004.0,1007.0,...,1008.0,1008.0,1007.0,1007.0,1010.0,1008.0,1003.0,990.0,990.0,1004.0
mean,4.731877,3.328373,3.11332,2.288557,2.123383,2.956132,2.761905,3.471698,3.761952,2.36147,...,3.234127,3.050595,3.201589,3.106256,2.870297,3.55754,20.433699,173.514141,66.405051,1.297809
std,0.664049,0.833931,1.170568,1.138916,1.076136,1.25257,1.260845,1.1614,1.184861,1.372995,...,1.323062,1.306321,1.188947,1.205368,1.28497,1.09375,2.82884,10.024505,13.839561,1.013348
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,15.0,62.0,41.0,0.0
25%,5.0,3.0,2.0,1.0,1.0,2.0,2.0,3.0,3.0,1.0,...,2.0,2.0,2.0,2.0,2.0,3.0,19.0,167.0,55.0,1.0
50%,5.0,3.0,3.0,2.0,2.0,3.0,3.0,4.0,4.0,2.0,...,3.0,3.0,3.0,3.0,3.0,4.0,20.0,173.0,64.0,1.0
75%,5.0,4.0,4.0,3.0,3.0,4.0,4.0,4.0,5.0,3.0,...,4.0,4.0,4.0,4.0,4.0,4.0,22.0,180.0,75.0,2.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,30.0,203.0,165.0,10.0


In [4]:
# column labels
df.columns

Index(['Music', 'Slow songs or fast songs', 'Dance', 'Folk', 'Country',
       'Classical music', 'Musical', 'Pop', 'Rock', 'Metal or Hardrock',
       ...
       'Age', 'Height', 'Weight', 'Number of siblings', 'Gender',
       'Left - right handed', 'Education', 'Only child', 'Village - town',
       'House - block of flats'],
      dtype='object', length=150)

In [5]:
# Shape of data
df.shape

(1010, 150)

We need to know which participants are male and which are female, so that we can add names of the correct gender.

In [6]:
# number of males, females, and unknown gender
n_male = (df['Gender'] == 'male').sum()
n_female = (df['Gender'] == 'female').sum()
n_unknown = df['Gender'].isnull().sum()
print(n_male, n_female, n_unknown)

411 593 6


There are 6 participants who did not give their gender, we'll give these people male names as there are fewer males in the dataset.

We've created a list of male names and a list of female names which we can import and clean up as follows.

In [7]:
# Import name lists
df_female_names = pd.read_csv('female_names.csv')[:n_female]
df_male_names = pd.read_csv('male_names.csv')[:n_male + n_unknown]

# remove non-ascii characters which have occurred because the names were copied from a website.
df_male_names['Name'] = df_male_names['Name'].apply(lambda x: x.replace('\xa0', ' '))
df_female_names['Name'] = df_female_names['Name'].apply(lambda x: x.replace('\xa0', ' '))

In [8]:
# Add new column to main dataframe for name
df['Name'] = ''

In [9]:
# Set the indices of the male name dataframe to be the indices of the males (and unknowns) in the
# main dataframe.
df_male_names = df_male_names.set_index(df.index[(df['Gender'] == 'male') | df['Gender'].isnull()])

# Do the same for females.
df_female_names = df_female_names.set_index(df.index[(df['Gender'] == 'female')])

In [10]:
# Add names into main dataframe.
df['Name'] = df_male_names
df.loc[df['Gender'] == 'female', 'Name'] = df_female_names

Now we have a column of names with the appropriate genders.

In [11]:
# Print the gender and names of the last few rows.
df[['Gender', 'Name']].tail()

Unnamed: 0,Gender,Name
1005,female,Caroline Wilks
1006,male,Roy Martin
1007,female,Lelia Williams
1008,female,Lauren Williamson
1009,male,Ciaran May


## Remove outliers

Explore data for abnormal values. Only the Age, Height, Weight, and Number of siblings questions allowed the participant to enter any value. So let's check the extreme values of these columns to see if there are any anomolies.

In [12]:
df[['Age', 'Height', 'Weight', 'Number of siblings']].agg(['min', 'max'])

Unnamed: 0,Age,Height,Weight,Number of siblings
min,15.0,62.0,41.0,0.0
max,30.0,203.0,165.0,10.0


Age and number of siblings look ok. The height is measured in cm and the weight in kg, so the min height and max weight look like errors. Let's look more closely at these values.

In [13]:
df[df['Height'] < 120]

Unnamed: 0,Music,Slow songs or fast songs,Dance,Folk,Country,Classical music,Musical,Pop,Rock,Metal or Hardrock,...,Height,Weight,Number of siblings,Gender,Left - right handed,Education,Only child,Village - town,House - block of flats,Name
676,5.0,4.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,...,62.0,55.0,2.0,female,right handed,college/bachelor degree,no,city,house/bungalow,Sarah Baker


That 62cm height looks like an error, given that she weighs 55kg. Let's replace it with NaN.

In [14]:
df.loc[676,'Height'] = np.NaN

In [15]:
df[df['Weight'] > 130]

Unnamed: 0,Music,Slow songs or fast songs,Dance,Folk,Country,Classical music,Musical,Pop,Rock,Metal or Hardrock,...,Height,Weight,Number of siblings,Gender,Left - right handed,Education,Only child,Village - town,House - block of flats,Name
885,3.0,4.0,3.0,2.0,2.0,2.0,3.0,4.0,4.0,4.0,...,,165.0,0.0,female,right handed,secondary school,yes,city,house/bungalow,Keava Mone
992,4.0,4.0,4.0,1.0,4.0,4.0,1.0,3.0,4.0,4.0,...,200.0,150.0,1.0,male,right handed,masters degree,no,city,block of flats,Stacey Lewis


There are very few people with a weight of 150kg or more. The participants may have misread the units, thinking they were entering 150lb instead. Let's replace these extreme values with NaN

In [22]:
df.loc[885,'Weight'] = np.NaN
df.loc[992,'Weight'] = np.NaN

Next, let's check whether there are any people who have incorrectly answered whether they are an only child and how many siblings they have. If someone is an only child, they have no siblings.

In [34]:
df[['Only child', 'Number of siblings']][(df['Only child'] == 'yes') & (df['Number of siblings'] >= 1)]

Unnamed: 0,Only child,Number of siblings
3,yes,1.0
25,yes,1.0
28,yes,2.0
47,yes,1.0
48,yes,1.0
65,yes,1.0
70,yes,1.0
71,yes,1.0
73,yes,1.0
131,yes,2.0


So there are 95 people who either incorrectly answered whether they are an only child or the number of siblings they have. Let's change these values to NaN for these participants.

In [36]:
i_error = df[['Only child', 'Number of siblings']][(df['Only child'] == 'yes') & (df['Number of siblings'] >= 1)].index
df.loc[i_error, ['Only child', 'Number of siblings']] = np.NaN

Now let's identify people who are likely to have randomly answered the questions. For this, we can use the [local outlier factor algorithm](https://en.wikipedia.org/wiki/Local_outlier_factor). TO DO

In [16]:
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor

In [24]:
df_float_only = df.dropna().loc[:, df.dtypes == float]
df_float_only_normalized = (df_float_only - df_float_only.min())/(df_float_only.max() - df_float_only.min())


# fit the model for outlier detection
clf = LocalOutlierFactor(n_neighbors=20)

clf.fit_predict(df_float_only_normalized)
X_scores = clf.negative_outlier_factor_

# # np.where(X_scores < -1.2)

df_dropna = df.dropna()
df_dropna['LOF score'] = X_scores

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [18]:
df_dropna.shape

(674, 152)

In [19]:
X_scores.shape

(674,)

In [20]:
print(df_dropna[df_dropna['LOF score'] < -1.4].iloc[0].to_string())

Music                                                   5
Slow songs or fast songs                                5
Dance                                                   1
Folk                                                    1
Country                                                 1
Classical music                                         5
Musical                                                 5
Pop                                                     1
Rock                                                    1
Metal or Hardrock                                       1
Punk                                                    1
Hiphop, Rap                                             1
Reggae, Ska                                             3
Swing, Jazz                                             4
Rock n roll                                             1
Alternative                                             5
Latino                                                  5
Techno, Trance

Unnamed: 0,Only child,Number of siblings
3,yes,1.0
25,yes,1.0
28,yes,2.0
47,yes,1.0
48,yes,1.0
65,yes,1.0
70,yes,1.0
71,yes,1.0
73,yes,1.0
131,yes,2.0


In [32]:
df[['Only child', 'Number of siblings']][(df['Only child'] == 'No') & (df['Number of siblings'] == 0)]

Unnamed: 0,Only child,Number of siblings
