#Activity:  Preparing Speed Dating Dataset

As an entrepreneur, you are planning to launch a new dating app into the market. The key feature that will differentiate your from other competitors will be your high performing matching algorithm between users. Before starting to build this model, you partnered with a speed dating company to collect some data from real events. You just received the dataset from your partner but realised it is not as clean as expected. So your task is to fix the main data quality issues you will find.
The following steps will help you complete this activity:
Download and load the dataset into Python
Check for duplicated rows
Check for unexpected values for numerical variables
Check for incorrect data type
Check for missing values
Fix identified issues if needed
The original dataset has been shared by Ray Fisman and Sheena Iyengar from Columbia Business School: 

http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating/

The authors have provided a very useful document describing the dataset and its features: 

http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating/Speed%20Dating%20Data%20Key.doc



1. Open on a new Colab notebook and import the pandas package

In [0]:
import pandas as pd

2. Assign the link to the dataset to a variable called 'file_url':

In [0]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter11/dataset/Speed_Dating_Data.csv'

3. Using the read_csv method from the package pandas, load the dataset into a new variable called 'df':

In [0]:
df = pd.read_csv(file_url)

4. Print the first 5 rows of the dataframe using the method .head():

In [0]:
df.head()

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,match,int_corr,samerace,age_o,race_o,pf_o_att,pf_o_sin,pf_o_int,pf_o_fun,pf_o_amb,pf_o_sha,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o,age,field,field_cd,undergra,mn_sat,tuition,race,...,amb5_2,you_call,them_cal,date_3,numdat_3,num_in_3,attr1_3,sinc1_3,intel1_3,fun1_3,amb1_3,shar1_3,attr7_3,sinc7_3,intel7_3,fun7_3,amb7_3,shar7_3,attr4_3,sinc4_3,intel4_3,fun4_3,amb4_3,shar4_3,attr2_3,sinc2_3,intel2_3,fun2_3,amb2_3,shar2_3,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,1,11.0,0,0.14,0,27.0,2.0,35.0,20.0,20.0,20.0,0.0,5.0,0,6.0,8.0,8.0,8.0,8.0,6.0,7.0,4.0,2.0,21.0,Law,1.0,,,,4.0,...,,1.0,1.0,0.0,,,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,,,,,,,,,,,,,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,2,12.0,0,0.54,0,22.0,2.0,60.0,0.0,0.0,40.0,0.0,0.0,0,7.0,8.0,10.0,7.0,7.0,5.0,8.0,4.0,2.0,21.0,Law,1.0,,,,4.0,...,,1.0,1.0,0.0,,,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,,,,,,,,,,,,,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,3,13.0,1,0.16,1,22.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,1,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0,21.0,Law,1.0,,,,4.0,...,,1.0,1.0,0.0,,,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,,,,,,,,,,,,,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,4,14.0,1,0.61,0,23.0,2.0,30.0,5.0,15.0,40.0,5.0,5.0,1,7.0,8.0,9.0,8.0,9.0,8.0,7.0,7.0,2.0,21.0,Law,1.0,,,,4.0,...,,1.0,1.0,0.0,,,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,,,,,,,,,,,,,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,5,15.0,1,0.21,0,24.0,3.0,30.0,10.0,20.0,10.0,10.0,20.0,1,8.0,7.0,9.0,6.0,9.0,7.0,8.0,6.0,2.0,21.0,Law,1.0,,,,4.0,...,,1.0,1.0,0.0,,,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,,,,,,,,,,,,,5.0,7.0,7.0,7.0,7.0,,,,,


5. Print out the shape of the dataframe (number of rows and columns) using the pandas attribute .shape:

In [0]:
df.shape

(8378, 195)

This dataset contains quite a lot of features (195) for 8378 rows. Let's check if there is any duplicated rows in it. 

6. Print out the number of duplicated rows (looking at all columns from the dataframe) by combining the pandas methods .duplicated() and .sum():

In [0]:
df.duplicated().sum()

0

Looking at the 195 columns of this dataset, there are no duplicate rows at all. Let's have an extra check but looking only at the identifiers variables listed in the dataset description document.

7. Print out the number of duplicated rows similarly to step 6 but this time look only at the identifiers columns ('iid','id','partner' and 'pid') by specifying the parameter 'subset':

In [0]:
df.duplicated(subset=['iid','id','partner','pid']).sum()


0

It seems there is no duplicated rows in this dataset.

Looking at the dataset description document, we know the values of the following variables should range between 1 and 10: 'imprace', 'imprelig', 'sports', 'tvsports', 'exercise', 'dining',
'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 
'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga',
'exphappy', 'satis_2'. In the next few steps, we are going to check there are no unexpected values for thes columns.

8. Create a variable called 'scale_1_10' which will list the following columns names: 'imprace', 'imprelig', 'sports', 'tvsports', 'exercise', 'dining',
'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 
'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga',
'exphappy', 'satis_2'.

In [0]:
scale_1_10 = ['imprace', 'imprelig', 'sports', 'tvsports', 'exercise', 'dining',
'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 
'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga',
'exphappy', 'satis_2']

10. Create a function called 'check_range' that takes as input parameter a column (a pandas Serie), minimum value and maximum value. The function will check for each row of the given column if it outside the given range (below minimum value or above the maximum value) and returns the corresponding list of binary value.

In [0]:
def check_range(column, min_value, max_value):
  return (column < min_value) | (column >max_value)

11. Test your function on the column 'imprace' and 1 and 10 respectively as the minimum and maximum values, save its output into a variable called 'unexpected_mask' and print its sum of to check how many cases are outside this range:

In [0]:
unexpected_mask = check_range(df['imprace'], 1, 10)
unexpected_mask.sum()

8

So there are 8 rows that have values for 'imprace' outside the expected range (between 1 and 10).

12. Define a function called 'print_unexpected' that takes as input parameter a dataframe, column name, a list of binary values. This function will check the sum of the binary values is over 0 and if it is the case print out the column name, this sum and the unique values of the given column and the rows that matches the binary list (keeping only True values) using the pandas method '.loc' and .unique():

In [0]:
def print_unexpected(df, col_name, unexpected_mask):
  if unexpected_mask.sum() > 0:
    print(col_name)
    print(unexpected_mask.sum())
    print(df.loc[unexpected_mask,col_name].unique())

13. Test your function on the column 'imprace' with the output of the previous function, unexpected_mask:

In [0]:
print_unexpected(df, 'imprace', unexpected_mask)

imprace
8
[0.]


We can see we still have 8 cases that are outside the expected range for this column and the unexpected value is 0.

14. Create a function called 'check_ranges' that takes as input parameter a dataframe, a list of columns, a minimum and maximum values. This function will iterate through each column from the given column list, called the function 'check_range' and pass its output to the function 'print_unexpected' you defined in steps 10 and 12:

In [0]:
def check_ranges(df, col_list, min_value, max_value):
  for col_name in col_list:
    unexpected_mask = check_range(df[col_name], min_value, max_value)
    print_unexpected(df, col_name, unexpected_mask)

15. Test this function with the dataset, the list 'scale_1_10' you defined at step 9 and 1 and 10 as their minimum and maximum values respectively:

In [0]:
check_ranges(df, scale_1_10, 1, 10)

imprace
8
[0.]
museums
18
[0.]
art
18
[0.]
hiking
18
[0.]
gaming
137
[14.  0.]
clubbing
18
[0.]
reading
51
[13.]
theater
18
[0.]
movies
18
[0.]
concerts
18
[0.]
yoga
36
[0.]


We can most of these columns have the unexpected value 0 and some of them have 13 and 14. In a real project, you will probably go and ask the surveyors if these values are expected or not. Let's say they confirmed the value 0 is actually a possible value in the survey but not 13 and 14 and they think it is just an error while they recorded these case and the value should be 10. Let's see how we can fix these issues in the next steps.

16. Create a function called 'replace_value' that takes as input parameter a dataframe, a column name, an incorrect value and a new value. This function will subset all the rows equals to the incorrect value for the given column and replace it with the new given value, print out the column name and the list of unique values of this column after replacement.

In [0]:
def replace_value(df, col_name, incorrect_value, new_value):
  df.loc[df[col_name] == incorrect_value, col_name] = new_value
  print(col_name)
  print(df[col_name].unique())

17. Test your function on the 'gaming' column, 14 as the incorrect value and 10 as the new value:

In [0]:
replace_value(df, 'gaming', 14, 10)

gaming
[ 1.  5.  4.  6.  2.  3.  7.  8. 10. nan  9.  0.]


We see that after replacement the value 14 is not part of the possible values of this column.

18. Use your function on the column 'reading', 13 as the incorrect value and 10 as the new value:

In [0]:
replace_value(df, 'reading', 13, 10)

reading
[ 6. 10.  7.  9.  8.  4.  5. nan  2.  3.  1.]


We see that after replacement the value 13 is not part of the possible values of this column.

19. Create a for loop that will iterate through the following suffixes: ['1_1', '1_2', '1_3', '1_s', '2_1', '2_2', '2_3', '4_1', '4_2', '4_3', '7_2', '7_3']. For each of them, create a list comprehension (or another for loop) to extract the columns which contain the given suffix by using the method .endswith() and store them into a variable called 'suffix_cols' and then apply the function 'check_ranges' on this list and 0 and 100 as their minimum and maximum values:

In [0]:
for suffix in ['1_1', '1_2', '1_3', '1_s', '2_1', '2_2', '2_3', '4_1', '4_2', '4_3', '7_2', '7_3']:
  suffix_cols = [col for col in df.columns if col.endswith(suffix)]
  check_ranges(df, suffix_cols, 0, 100)

There is no output displayed that means all these columns have values within the expected range: between 0 and 100.

20. Create a similar for loop as step 19 for the following suffixes and with 1 and 10 as minimum and maximum values: ['3_1', '3_2', '3_3', '5_1', '5_2', '5_3', '3_s']

In [0]:
for suffix in ['3_1', '3_2', '3_3', '5_1', '5_2', '5_3', '3_s']:
  suffix_cols = [col for col in df.columns if col.endswith(suffix)]
  check_ranges(df, suffix_cols, 1, 10)

attr3_3
112
[12.]
sinc3_3
173
[12.]
intel3_3
233
[12.]
fun3_3
153
[12.]
amb3_3
147
[12.]


We can see that all columns ending with '3_3' have 12 as unexpected values. Let's say after concertation with the surveyors we agreed to replace these values by 10.

21. Create a for loop that iterates through the list of columns ending with '3_3' and call the function 'replace_values' for each of them and provide 12 as the incorect value and 10 as the new value:

In [0]:
for col_name in ['attr3_3', 'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3']:
  replace_value(df, col_name, 12, 10)

attr3_3
[ 5.  7. nan  6.  4.  9.  8.  3. 10.  2.]
sinc3_3
[ 7.  6. nan  5.  8.  9. 10.  4.  3.  2.]
intel3_3
[ 7.  9. nan  6. 10.  8.  5.  4.  3.]
fun3_3
[ 7.  9. nan  8.  6.  3.  5. 10.  2.  4.]
amb3_3
[ 7.  4. nan  5. 10.  9.  8.  6.  2.  3.  1.]


Great! We have fixed the unexpected values for these columns.

22. Print the data type of each variable using the attribute dtypes:

In [0]:
df.dtypes

iid           int64
id          float64
gender        int64
idg           int64
condtn        int64
             ...   
attr5_3     float64
sinc5_3     float64
intel5_3    float64
fun5_3      float64
amb5_3      float64
Length: 195, dtype: object

We can see most of the columns have been detected as numerical variables but looking at the dataset description document, we know that they are categorical for most of them. Let's change their data type.

23. Create a list called 'num_cols' containing the following list of columns: 'round', 'order', 'int_corr', 'age', 'mn_sat', 'income', 'expnum'

In [0]:
num_cols = ['round', 'order', 'int_corr', 'age', 'mn_sat', 'income', 'expnum']

24. Create another list called 'cat_cols' containing the remaining columns names (excluding the ones in num_cols) of this dataframe using the attribute columns combined with the method '.difference()':

In [0]:
cat_cols = df.columns.difference(num_cols)

25. Create a for loop that will iterate through cat_cols and change the data type for each of them into a category using the method '.astype()':

In [0]:
for col_name in cat_cols:
  df[col_name] = df[col_name].astype('category')

26. Print the data type of each variable using the attribute dtypes:

In [0]:
df.dtypes

iid         category
id          category
gender      category
idg         category
condtn      category
              ...   
attr5_3     category
sinc5_3     category
intel5_3    category
fun5_3      category
amb5_3      category
Length: 195, dtype: object

Great! We have sorted out the data type for each column. Now let's see of we have missing columns for the numerical fields.

27. Print the number of missing values for each column in num_cols by combining the methods .isna() and .sum():

In [0]:
df[num_cols].isna().sum()

round          0
order          0
int_corr     158
age           95
mn_sat      5245
income      4099
expnum      6578
dtype: int64

There are some missing values for most of these columns. We need to fix these cases. Let's start with the column 'int_corr':

28. Print the unique values of the variable 'int_corr' using the method .unique():

In [0]:
df['int_corr'].unique()

array([ 0.14,  0.54,  0.16,  0.61,  0.21,  0.25,  0.34,  0.5 ,  0.28,
       -0.36,  0.29,  0.18,  0.1 , -0.21,  0.32,  0.73,  0.6 ,  0.07,
        0.11,  0.39, -0.24, -0.14,  0.09, -0.04, -0.3 , -0.26, -0.15,
       -0.47, -0.18,  0.05,  0.37,  0.35,  0.15, -0.19, -0.43,  0.  ,
       -0.17,  0.08, -0.16,  0.06, -0.05, -0.13, -0.06,  0.33, -0.51,
        0.12,  0.19,  0.47,  0.03,  0.46,  0.43,  0.52, -0.46, -0.27,
        0.59,  0.31, -0.34, -0.03, -0.11,  0.42, -0.4 , -0.23,  0.17,
        0.68, -0.01, -0.35,  0.3 ,  0.65,  0.24,  0.41,  0.49,  0.01,
        0.22, -0.08,  0.27,  0.44,  0.62, -0.2 , -0.02, -0.33, -0.52,
       -0.1 ,  0.58, -0.57, -0.31, -0.07, -0.32,  0.04, -0.12,  0.48,
       -0.22, -0.29,  0.38,  0.53, -0.38,  0.02, -0.28,  0.13,  0.2 ,
         nan, -0.41, -0.44,  0.51, -0.48,  0.4 ,  0.26,  0.77, -0.49,
       -0.25, -0.09,  0.45, -0.39,  0.83,  0.57, -0.61,  0.72, -0.37,
        0.23, -0.58,  0.8 , -0.56,  0.63, -0.63,  0.71,  0.36,  0.56,
        0.55,  0.76,

The values of the column'int_corr' range between -1 and 1. It seems they have been normalised. As there are no extreme values or outliers, we can impute the missing values with the mean of this variable. This is what we are going to do in the next few steps.

29. Create a condition mask called int_corr_mask for finding the missing values in the column 'int_corr' using the method .isna():

In [0]:
int_corr_mask = df['int_corr'].isna()

30. Display the number of missing values for this column using the method .sum() on 'int_corr_mask':

In [0]:
int_corr_mask.sum()

158

We got the exact same number of missing values for 'int_corr' as in step 27.

31. Extract the mean of 'int_corr' using the method '.mean()' and store it in a new variable called int_corr_mean. Print its value:

In [0]:
int_corr_mean = df['int_corr'].mean()
print(int_corr_mean)

0.19600973236009664


The average value for this column is 0.196. We will replace all missing values by this value in the column 'int_corr'.

32. Replace all missing values from the variable 'int_corr' with its average using the method '.fillna()' with the parameter 'inplace=True': 

In [0]:
df['int_corr'].fillna(int_corr_mean, inplace=True)

33. Print the number of missing values for 'int_corr' by combining the methods .isna() and .sum():

In [0]:
df['int_corr'].isna().sum()

0

Perfect! There is no mising value anymore in the variable.

34. Create a new variable called 'missing_num_cols' containing the following columns: 'age', 'mn_sat', 'income', 'expnum'

In [0]:
missing_num_cols = ['age', 'mn_sat', 'income', 'expnum']

35. Create a for loop that will iterate through the columns in 'missing_num_cols' and print their name and their list of unique values using the method '.unique()':

In [0]:
for col_name in missing_num_cols:
  print(col_name)
  print(df[col_name].unique())

age
[21. 24. 25. 23. 22. 26. 27. 30. 28. nan 29. 34. 35. 32. 39. 20. 19. 18.
 37. 33. 36. 31. 42. 38. 55.]
mn_sat
[  nan 1070. 1258. 1400. 1290. 1460. 1430. 1215. 1330. 1450. 1155. 1140.
 1360. 1402. 1250. 1210. 1220. 1410. 1260. 1380. 1030. 1309. 1308. 1050.
 1100. 1310. 1490. 1188. 1097. 1212. 1340. 1034. 1185. 1242. 1160. 1099.
 1214. 1270. 1110. 1178. 1060. 1157. 1180. 1014. 1341.  990. 1320. 1159.
 1370. 1105. 1365. 1011. 1130. 1206. 1331. 1191.  914. 1200. 1080. 1090.
 1092. 1470. 1149. 1134. 1230. 1267. 1280. 1227. 1239.]
income
[ 69487.  65929.     nan  37754.  86340.  60304.  54620.  48652.  29237.
  56580.  36782.  38548.  52010.  28418.  43185.  23152.  43664.  48441.
  61152.  36485.  41507.  17134.  30038.  33772.  24997.  42096.  28891.
  62635.  12063.  29809.  26482.  30147.  39919.  41466.  23988.  28989.
  50948.  38022.  47559.  53539.  32159.  53940.  40753.  38207.  46166.
  30973.  28317.  26645.  25589.  55223. 109031.  40409.  21597.  76624.
  35968.  51725.  55

The values for these columns are not normalised and some of them have outliers so this time we are going to use their median to fill in the missing values.

36. Create a for loop similar to step 35 but this time you will calculate the median of each column and save it into a variable called 'col_median', impute missing values with this median value using the method '.fillna()' with the parameter 'inplace=True', print the name of the column and its median value:

In [0]:
for col_name in missing_num_cols:
  col_median = df[col_name].median()
  df[col_name].fillna(col_median, inplace=True)
  print(col_name)
  print(col_median)

age
26.0
mn_sat
1310.0
income
43185.0
expnum
4.0


37. Create a for loop similar to step 35 but this time you will print the name of each column and their number of missing values using the combination of the methods '.isna()' and '.sum()':

In [0]:
for col_name in missing_num_cols:
  print(col_name)
  print(df[col_name].isna().sum())

age
0
mn_sat
0
income
0
expnum
0


Excellent! In this activity we have cleaned most of the main quality issues for this dataset. We looked for duplication, incorrect values, wrong data types and missing values. You have also put in practice all the techniques we learned in this chapter to fix these issues. We are now more confident in using this modified version of the dataset to the next step of the project if this was a real use case.