<a href="https://colab.research.google.com/github/recervictory/LearingPython/blob/Student/08%20-%20Pandas%20III%20-%20Data%20Cleaning%20and%20Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning and Preparation

During the course of doing data analysis and modeling, a *significant amount of time* is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up *80% or more of an analyst’s time*.



In [2]:
import pandas as pd
import numpy as np
from numpy import nan as NA # represent NaN as NA

## A. Handling Missing Data
Missing data occurs commonly in many data analysis applications. One of the goals
of pandas is to make working with missing data as painless as possible. For example,
all of the descriptive statistics on pandas objects exclude missing data by default.

The way that missing data is represented in pandas objects is somewhat imperfect,
but it is functional for a lot of users. For numeric data, pandas uses the floating-point
value NaN (Not a Number) to represent missing data.

The built-in Python **None** value is also treated as NA in object arrays:

In [None]:
string_data = pd.Series(['Kolkata', 'Delhi', np.nan, 'Bangalore'])
string_data

0      Kolkata
1        Delhi
2          NaN
3    Bangalore
dtype: object

In [None]:
# Is any value null/ none /nan
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [None]:
# The built-in Python None value is also treated as NA in object arrays: None, NA, nan
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

### NA handling methods
- `dropna` Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
- `fillna` Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
- `isnull` Return boolean values indicating which values are missing/NA.
- `notnull` Negation of isnull

### Filtering Out Missing Data
While you always have the option to do it by hand using `pandas.isnull` and boolean indexing, the `dropna` can be helpful.

In [None]:
data = pd.Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [None]:
# Droping the Data
data.dropna() # not permanent

0    1.0
2    3.5
4    7.0
dtype: float64

In [None]:
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [None]:
# This is equivalent to:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, things are a bit more complex. You may want to drop rows
or columns that are all NA or only those containing any `NAs`. 
The `dropna` by default drops **any row containing a missing value**:

In [None]:
 
 data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [None]:
 cleaned = data.dropna()
 cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [None]:
# check col wise missing value
data.isnull().sum()

0    2
1    2
2    2
dtype: int64

In [None]:
# Passing how='all' will only drop rows that are all NA:
data.dropna(how='all')

In [None]:
# To drop columns in the same way, pass axis=1:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [None]:
# Drop data column wise 
# in python 
# axis = 1 col
# axis = 0 : row
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


### Filling In Missing Data
For most purposes, the fillna method is the workhorse function to use. Calling fillna with a **constant** replaces **missing values** with that value:

In [None]:
# df created
df = pd.DataFrame(np.random.randn(7, 3), columns=['gold', 'silver', 'copper'])
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,gold,silver,copper
0,0.960932,,
1,1.455036,,
2,1.334481,,-1.208838
3,-1.001076,,2.365561
4,-0.360475,0.525528,1.530065
5,0.241174,0.083004,-0.299816
6,-1.156792,-1.296698,1.883608


In [None]:
# Fill The missing values with Zero
df.fillna(0)

Unnamed: 0,gold,silver,copper
0,0.960932,0.0,0.0
1,1.455036,0.0,0.0
2,1.334481,0.0,-1.208838
3,-1.001076,0.0,2.365561
4,-0.360475,0.525528,1.530065
5,0.241174,0.083004,-0.299816
6,-1.156792,-1.296698,1.883608


In [None]:
# Calling fillna with a dict, you can use a different fill value for each column:
df.fillna({'silver': -1, 'copper': 1})

Unnamed: 0,gold,silver,copper
0,0.960932,-1.0,1.0
1,1.455036,-1.0,1.0
2,1.334481,-1.0,-1.208838
3,-1.001076,-1.0,2.365561
4,-0.360475,0.525528,1.530065
5,0.241174,0.083004,-0.299816
6,-1.156792,-1.296698,1.883608


In [None]:
# fillna returns a new object, but you can modify the existing object in-place:
df.fillna(0)
df


Unnamed: 0,gold,silver,copper
0,0.960932,,
1,1.455036,,
2,1.334481,,-1.208838
3,-1.001076,,2.365561
4,-0.360475,0.525528,1.530065
5,0.241174,0.083004,-0.299816
6,-1.156792,-1.296698,1.883608


In [None]:
df.fillna(0, inplace=True) # Important
df

Unnamed: 0,gold,silver,copper
0,0.960932,0.0,0.0
1,1.455036,0.0,0.0
2,1.334481,0.0,-1.208838
3,-1.001076,0.0,2.365561
4,-0.360475,0.525528,1.530065
5,0.241174,0.083004,-0.299816
6,-1.156792,-1.296698,1.883608


The same **interpolation** methods available for reindexing can be used with fillna:

In [None]:
# Creating Dataframe
df = pd.DataFrame(np.random.randn(8, 3), columns=['gold', 'silver', 'copper'])
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,gold,silver,copper
0,0.159362,0.527933,1.037528
1,-0.596954,1.031973,-0.395796
2,0.501706,,0.790893
3,0.379001,,-0.332592
4,-1.497654,,
5,0.633566,,
6,0.945829,,
7,-1.339841,,


In [None]:
# Fill 'NA' with forword fill method
df.fillna(method='ffill')

Unnamed: 0,gold,silver,copper
0,0.611627,-0.148583,-0.680851
1,-0.784319,-0.31051,-0.314903
2,1.517101,-0.31051,-0.349676
3,0.987027,-0.31051,0.007926
4,-3.077174,-0.31051,0.007926
5,0.726522,-0.31051,0.007926
6,0.106931,-0.31051,0.007926
7,0.993971,-0.31051,0.007926


In [None]:
# limit by row
df.fillna(method='ffill', limit=3)

Unnamed: 0,gold,silver,copper
0,0.611627,-0.148583,-0.680851
1,-0.784319,-0.31051,-0.314903
2,1.517101,-0.31051,-0.349676
3,0.987027,-0.31051,0.007926
4,-3.077174,-0.31051,0.007926
5,0.726522,,0.007926
6,0.106931,,0.007926
7,0.993971,,


In [None]:
# you might pass the mean or median values
df.fillna(df.max())

Unnamed: 0,gold,silver,copper
0,0.159362,0.527933,1.037528
1,-0.596954,1.031973,-0.395796
2,0.501706,1.031973,0.790893
3,0.379001,1.031973,-0.332592
4,-1.497654,1.031973,1.037528
5,0.633566,1.031973,1.037528
6,0.945829,1.031973,1.037528
7,-1.339841,1.031973,1.037528


In [None]:
df.fillna(df.mean(),inplace=True)
df

Unnamed: 0,gold,silver,copper
0,0.611627,-0.148583,-0.680851
1,-0.784319,-0.31051,-0.314903
2,1.517101,-0.229546,-0.349676
3,0.987027,-0.229546,0.007926
4,-3.077174,-0.229546,-0.334376
5,0.726522,-0.229546,-0.334376
6,0.106931,-0.229546,-0.334376
7,0.993971,-0.229546,-0.334376


In [None]:
df['category'] = [NA,'A','B',NA,'A','C','A','B']
df

Unnamed: 0,gold,silver,copper,category
0,0.159362,0.527933,1.037528,
1,-0.596954,1.031973,-0.395796,A
2,0.501706,,0.790893,B
3,0.379001,,-0.332592,
4,-1.497654,,,A
5,0.633566,,,C
6,0.945829,,,A
7,-1.339841,,,B


In [None]:
df['category'].mode()[0] # Fisrt one

'A'

In [None]:
df['category'].fillna(df['category'].mode()[0], inplace=True) # mode most freq
df

Unnamed: 0,gold,silver,copper,category
0,0.159362,0.527933,1.037528,A
1,-0.596954,1.031973,-0.395796,A
2,0.501706,,0.790893,B
3,0.379001,,-0.332592,A
4,-1.497654,,,A
5,0.633566,,,C
6,0.945829,,,A
7,-1.339841,,,B


## B. Data Transformation

### Removing Duplicates
Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:

In [None]:
data = pd.DataFrame({'city': ['kolkata', 'delhi'] * 3 + ['delhi'],'count': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,city,count
0,kolkata,1
1,delhi,1
2,kolkata,2
3,delhi,3
4,kolkata,3
5,delhi,4
6,delhi,4


The DataFrame method `duplicated()` returns a **boolean Series** indicating whether each row is a duplicate (has been observed in a previous row) or not:

In [None]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

The `drop_duplicates()` returns a DataFrame where the duplicated array is False:

In [None]:
data.drop_duplicates()

Unnamed: 0,city,count
0,kolkata,1
1,delhi,1
2,kolkata,2
3,delhi,3
4,kolkata,3
5,delhi,4


In [None]:
data['price'] = np.random.randint(10,100,size=7)
data

Unnamed: 0,city,count,price
0,kolkata,1,57
1,delhi,1,44
2,kolkata,2,91
3,delhi,3,23
4,kolkata,3,39
5,delhi,4,61
6,delhi,4,78


In [None]:
 # Drop duplicate by column
 data.drop_duplicates(['count'])

Unnamed: 0,city,count,price
0,kolkata,1,57
2,kolkata,2,91
3,delhi,3,23
5,delhi,4,61


### Transforming Data Using a Function or Mapping
For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. 

In [None]:
data = pd.DataFrame({'city':['New York','Delhi','Kolkata','Chicago','Las Vegas'], 
                     'pupulation': np.random.randint(100000,1000000000,size=5)
                     })
data

Unnamed: 0,city,pupulation
0,New York,731581758
1,Delhi,855141249
2,Kolkata,39045168
3,Chicago,795039675
4,Las Vegas,487601410


In [None]:
city_to_country = {'new york':'usa','delhi':'india','kolkata':'india','chicago':'usa','las vegas':'usa'}
city_to_country

{'chicago': 'usa',
 'delhi': 'india',
 'kolkata': 'india',
 'las vegas': 'usa',
 'new york': 'usa'}

In [None]:
# We Need to cheack the data type
data.dtypes

city          object
pupulation     int64
dtype: object

In [None]:
data['city'] = data['city'].str.lower()
data

Unnamed: 0,city,pupulation
0,new york,731581758
1,delhi,855141249
2,kolkata,39045168
3,chicago,795039675
4,las vegas,487601410


In [None]:
data['country'] = data['city'].map(city_to_country)
data

Unnamed: 0,city,pupulation,country
0,new york,731581758,usa
1,delhi,855141249,india
2,kolkata,39045168,india
3,chicago,795039675,usa
4,las vegas,487601410,usa


In [None]:
zip = {'usa': '+1','india':'+91','uk':'+7'}
data['zip'] = data['country'].map(zip)
data

Unnamed: 0,city,pupulation,country,zip
0,new york,731581758,usa,1
1,delhi,855141249,india,91
2,kolkata,39045168,india,91
3,chicago,795039675,usa,1
4,las vegas,487601410,usa,1


### Replacing Values
Filling in missing data with the `fillna()` method is a special case of more general value replacement. As you’ve already seen, `map()` can be used to modify a subset of values in an object but replace provides a simpler and more flexible way to do so. 

In [None]:
data['pupulation'] = data['pupulation'].replace([843448647,127963973	],np.nan)
data

Unnamed: 0,city,pupulation,country,zip
0,new york,731581758,usa,1
1,delhi,855141249,india,91
2,kolkata,39045168,india,91
3,chicago,795039675,usa,1
4,las vegas,487601410,usa,1


### Detecting and Filtering Outliers
Filtering or transforming **outliers** is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

In [None]:
data = pd.DataFrame(np.random.randn(1000, 4),columns=['Aaba','Baba','Caca','Dada'])

# Lets find out the outliers
data.describe()

Unnamed: 0,Aaba,Baba,Caca,Dada
count,1000.0,1000.0,1000.0,1000.0
mean,-0.014128,-0.02883,-0.009874,0.004128
std,0.981378,0.962308,1.001781,1.027538
min,-2.964888,-3.355816,-3.54911,-3.167404
25%,-0.710339,-0.686079,-0.679527,-0.654529
50%,-0.012924,-0.024915,-0.062036,0.036778
75%,0.66758,0.629801,0.670944,0.72562
max,3.23814,2.930522,3.342285,2.978472


In [None]:
data[np.abs(data['Caca']) > 3]

Unnamed: 0,Aaba,Baba,Caca,Dada
157,-1.459662,0.067723,-3.034796,1.111442
260,1.419877,0.72515,3.342285,-0.088954
645,1.001509,-0.440654,-3.54911,-0.18853


In [None]:
# Detecting outleirs from any columns in the dataframe

data[(np.abs(data) > 3).any(1)] # axis = 1 i.e column wise

Unnamed: 0,Aaba,Baba,Caca,Dada
39,1.455963,-3.116615,-0.11166,-0.775272
46,1.100756,-0.036092,1.373986,-3.167404
157,-1.459662,0.067723,-3.034796,1.111442
260,1.419877,0.72515,3.342285,-0.088954
444,1.261487,1.118862,0.3156,-3.092298
504,-0.687179,-3.115978,1.511751,0.653478
518,-1.375066,-3.110294,0.935354,0.956307
645,1.001509,-0.440654,-3.54911,-0.18853
688,3.23814,0.435656,-0.571016,-1.438396
828,0.224491,-3.355816,0.763214,-0.383391


In [None]:
data[(np.abs(data) > 3).all(1)]

Unnamed: 0,Aaba,Baba,Caca,Dada


In [None]:
new_row = {'Aaba' : 4,	'Baba':4,	'Caca': -4,	'Dada': -4}
data = data.append(new_row,ignore_index=True)
data[(np.abs(data) > 3).all(1)]

Unnamed: 0,Aaba,Baba,Caca,Dada
1000,4.0,4.0,-4.0,-4.0


# Project: Risk of being drawn into online sex work

### Context
This database was used in the paper: Covert online ethnography and machine learning for detecting individuals at risk of being drawn into online sex work. 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Barcelona, Spain, 28-31 August.

### Content
The database includes data scraped from a European online adult forum. Using covert online ethnography we interviewed a small number of participants and determined their risk to either supply or demand sex services through that forum. This is a great dataset for semi-supervised learning.

### Inspiration
How can we identify individuals at risk of being drawn into online sex work? The spread of online social media enables a greater number of people to be involved into online sex trade; however, detecting deviant behaviors online is limited by the low available of data. To overcome this challenge, we combine covert online ethnography with semi-supervised learning using data from a popular European adult forum.

## Importing Data

In [3]:
import pandas as pd
import numpy as np

import warnings; warnings.filterwarnings('ignore')

In [4]:
# Created a dataframe named it df
# loaded un-cleaned data
df = pd.read_csv('/content/online_sex_work.csv', index_col=0)
# sub part of data
df = df.iloc[: 28831, :]

df.head() # six rows

Unnamed: 0_level_0,Gender,Age,Location,Verification,Sexual_orientation,Sexual_polarity,Looking_for,Points_Rank,Last_login,Member_since,Number_of_Comments_in_public_forum,Time_spent_chating_H:M,Number_of_advertisments_posted,Number_of_offline_meetings_attended,Profile_pictures,Friends_ID_list,Risk
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
10386.0,male,346,A,Non_Verified,Homosexual,Switch,Men,50,before_10_days,17.9.2012,32,0:2,0.0,0.0,0.0,18260,No_risk
14.0,male,322,J,Non_Verified,Heterosexual,Dominant,Women,518,before_1_days,1.11.2009,710,3:45,9.0,0.0,0.0,11778320244376823969273184588431277,No_risk
16721.0,male,336,K,Non_Verified,Heterosexual,Dominant,Women,150,before_3_days,1.4.2013,25,2:15,1.0,1.0,45.0,198052172119802,No_risk
16957.0,male,34,H,Non_Verified,Heterosexual,Dominant,Women,114,before_4_days,8.4.2013,107,359:22,1.0,0.0,1.0,"40847,38183,9507,42259,5807,28118,24848,37170,...",No_risk
17125.0,male,395,B,Non_Verified,Heterosexual,Dominant,Women,497,before_5_days,14.4.2013,600,0:21,0.0,6.0,8.0,"1320,35739,34231,19097,20197,18069,12330,43342...",No_risk


In [5]:
df.tail() # last 6  rows

Unnamed: 0_level_0,Gender,Age,Location,Verification,Sexual_orientation,Sexual_polarity,Looking_for,Points_Rank,Last_login,Member_since,Number_of_Comments_in_public_forum,Time_spent_chating_H:M,Number_of_advertisments_posted,Number_of_offline_meetings_attended,Profile_pictures,Friends_ID_list,Risk
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
9962.0,male,272,J,Non_Verified,Heterosexual,Switch,Women,0,before_23_days,5.9.2012,0,0:3,1.0,0.0,0.0,,unknown_risk
9964.0,male,464,J,Non_Verified,bicurious,Submisive,Men_and_Women,15,before_597_days,5.9.2012,0,0:0,0.0,0.0,0.0,,unknown_risk
9966.0,male,288,C,Non_Verified,Heterosexual,Submisive,Women,15,before_4_days,5.9.2012,0,0:0,0.0,0.0,0.0,,unknown_risk
9968.0,male,315,J,Non_Verified,bisexual,Submisive,Men,30,before_665_days,1.9.2012,0,2:54,4.0,0.0,0.0,,unknown_risk
998.0,female,387,F,Non_Verified,Heterosexual,Dominant,Nobody,20,before_157_days,20.5.2010,0,0:9,0.0,0.0,1.0,,unknown_risk


In [6]:
# Understand the Data Types
df.dtypes

Gender                                  object
Age                                     object
Location                                object
Verification                            object
Sexual_orientation                      object
Sexual_polarity                         object
Looking_for                             object
Points_Rank                             object
Last_login                              object
Member_since                            object
Number_of_Comments_in_public_forum      object
Time_spent_chating_H:M                  object
Number_of_advertisments_posted         float64
Number_of_offline_meetings_attended    float64
Profile_pictures                       float64
Friends_ID_list                         object
Risk                                    object
dtype: object

## Data Cleaning


### Change datatype for some features

Data in a number of features that contain numerical data could be converted into pure numbers (integers), which would take less memory and could be interpreted more easily by machine learning models.

In [7]:
df.index = df.index.astype(int) # change the type to int
df['Number_of_advertisments_posted'] = df['Number_of_advertisments_posted'].astype(int)
df['Number_of_offline_meetings_attended'] = df['Number_of_offline_meetings_attended'].astype(int)
df['Profile_pictures'] = df['Profile_pictures'].astype(int)
df['Friends_ID_list'] = df['Friends_ID_list'].astype(str)
df['Risk'] = df['Risk'].astype(str)

df.head()

Unnamed: 0_level_0,Gender,Age,Location,Verification,Sexual_orientation,Sexual_polarity,Looking_for,Points_Rank,Last_login,Member_since,Number_of_Comments_in_public_forum,Time_spent_chating_H:M,Number_of_advertisments_posted,Number_of_offline_meetings_attended,Profile_pictures,Friends_ID_list,Risk
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
10386,male,346,A,Non_Verified,Homosexual,Switch,Men,50,before_10_days,17.9.2012,32,0:2,0,0,0,18260,No_risk
14,male,322,J,Non_Verified,Heterosexual,Dominant,Women,518,before_1_days,1.11.2009,710,3:45,9,0,0,11778320244376823969273184588431277,No_risk
16721,male,336,K,Non_Verified,Heterosexual,Dominant,Women,150,before_3_days,1.4.2013,25,2:15,1,1,45,198052172119802,No_risk
16957,male,34,H,Non_Verified,Heterosexual,Dominant,Women,114,before_4_days,8.4.2013,107,359:22,1,0,1,"40847,38183,9507,42259,5807,28118,24848,37170,...",No_risk
17125,male,395,B,Non_Verified,Heterosexual,Dominant,Women,497,before_5_days,14.4.2013,600,0:21,0,6,8,"1320,35739,34231,19097,20197,18069,12330,43342...",No_risk


In [8]:
df.dtypes

Gender                                 object
Age                                    object
Location                               object
Verification                           object
Sexual_orientation                     object
Sexual_polarity                        object
Looking_for                            object
Points_Rank                            object
Last_login                             object
Member_since                           object
Number_of_Comments_in_public_forum     object
Time_spent_chating_H:M                 object
Number_of_advertisments_posted          int64
Number_of_offline_meetings_attended     int64
Profile_pictures                        int64
Friends_ID_list                        object
Risk                                   object
dtype: object

In [11]:
# cheack the Error
df['Number_of_Comments_in_public_forum'] = df['Number_of_Comments_in_public_forum'].astype(int)

In [10]:
# We use replace method to replace a pattern from string
df['Number_of_Comments_in_public_forum'] = df['Number_of_Comments_in_public_forum'].str.replace(' ', '').astype(int)


In [15]:
df.dtypes

Gender                                 object
Age                                    object
Location                               object
Verification                           object
Sexual_orientation                     object
Sexual_polarity                        object
Looking_for                            object
Points_Rank                            object
Last_login                             object
Member_since                           object
Number_of_Comments_in_public_forum      int64
Time_spent_chating_H:M                 object
Number_of_advertisments_posted          int64
Number_of_offline_meetings_attended     int64
Profile_pictures                        int64
Friends_ID_list                        object
Risk                                   object
dtype: object

### Counting the Missing Values

In [16]:
# Count of missing values column wise
df.isnull().sum()

Gender                                   4
Age                                      0
Location                                 1
Verification                             0
Sexual_orientation                       1
Sexual_polarity                          1
Looking_for                            425
Points_Rank                              0
Last_login                               0
Member_since                             0
Number_of_Comments_in_public_forum       0
Time_spent_chating_H:M                   0
Number_of_advertisments_posted           0
Number_of_offline_meetings_attended      0
Profile_pictures                         0
Friends_ID_list                          0
Risk                                     0
dtype: int64

### Convert `Gender` to binary data

In the `Gender` column, We fill some missing values using some simple conditions (if the entry is, for example, homosexual, and looking for men, we fill that entry with `male`), using the `fill_gender_na` function below. Then in every entry, we change the data to whether it specifies `female` or not.

In [19]:
def fill_gender_na(row):
    if row['Sexual_orientation'] == 'Homosexual':
        if row['Gender'] == 'male':
            return 'Men'
        elif row['Gender'] == 'female':
            return 'Women'
    elif row['Sexual_orientation'] == 'Heterosexual':
        if row['Gender'] == 'female':
            return 'Men'
        elif row['Gender'] == 'male':
            return 'Women'
    return np.nan

In [20]:
## Fill the missing data
fill_values = df.apply(fill_gender_na, axis=1)
df['Looking_for'].fillna(fill_values, inplace=True)

In [23]:
# Lets check the missing values
df.isnull().sum()

Gender                                 4
Age                                    0
Location                               1
Verification                           0
Sexual_orientation                     1
Sexual_polarity                        1
Looking_for                            0
Points_Rank                            0
Last_login                             0
Member_since                           0
Number_of_Comments_in_public_forum     0
Time_spent_chating_H:M                 0
Number_of_advertisments_posted         0
Number_of_offline_meetings_attended    0
Profile_pictures                       0
Friends_ID_list                        0
Risk                                   0
dtype: int64

In [22]:
# Add missing value with summary statistics 
df['Looking_for'].fillna(df['Looking_for'].mode()[0], inplace=True)
df.head()

Unnamed: 0_level_0,Gender,Age,Location,Verification,Sexual_orientation,Sexual_polarity,Looking_for,Points_Rank,Last_login,Member_since,Number_of_Comments_in_public_forum,Time_spent_chating_H:M,Number_of_advertisments_posted,Number_of_offline_meetings_attended,Profile_pictures,Friends_ID_list,Risk
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
10386,male,346,A,Non_Verified,Homosexual,Switch,Men,50,before_10_days,17.9.2012,32,0:2,0,0,0,18260,No_risk
14,male,322,J,Non_Verified,Heterosexual,Dominant,Women,518,before_1_days,1.11.2009,710,3:45,9,0,0,11778320244376823969273184588431277,No_risk
16721,male,336,K,Non_Verified,Heterosexual,Dominant,Women,150,before_3_days,1.4.2013,25,2:15,1,1,45,198052172119802,No_risk
16957,male,34,H,Non_Verified,Heterosexual,Dominant,Women,114,before_4_days,8.4.2013,107,359:22,1,0,1,"40847,38183,9507,42259,5807,28118,24848,37170,...",No_risk
17125,male,395,B,Non_Verified,Heterosexual,Dominant,Women,497,before_5_days,14.4.2013,600,0:21,0,6,8,"1320,35739,34231,19097,20197,18069,12330,43342...",No_risk


In [25]:
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df.head()

Unnamed: 0_level_0,Gender,Age,Location,Verification,Sexual_orientation,Sexual_polarity,Looking_for,Points_Rank,Last_login,Member_since,Number_of_Comments_in_public_forum,Time_spent_chating_H:M,Number_of_advertisments_posted,Number_of_offline_meetings_attended,Profile_pictures,Friends_ID_list,Risk
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
10386,male,346,A,Non_Verified,Homosexual,Switch,Men,50,before_10_days,17.9.2012,32,0:2,0,0,0,18260,No_risk
14,male,322,J,Non_Verified,Heterosexual,Dominant,Women,518,before_1_days,1.11.2009,710,3:45,9,0,0,11778320244376823969273184588431277,No_risk
16721,male,336,K,Non_Verified,Heterosexual,Dominant,Women,150,before_3_days,1.4.2013,25,2:15,1,1,45,198052172119802,No_risk
16957,male,34,H,Non_Verified,Heterosexual,Dominant,Women,114,before_4_days,8.4.2013,107,359:22,1,0,1,"40847,38183,9507,42259,5807,28118,24848,37170,...",No_risk
17125,male,395,B,Non_Verified,Heterosexual,Dominant,Women,497,before_5_days,14.4.2013,600,0:21,0,6,8,"1320,35739,34231,19097,20197,18069,12330,43342...",No_risk


In [26]:
# Lets check the missing values
df.isnull().sum()

Gender                                 0
Age                                    0
Location                               1
Verification                           0
Sexual_orientation                     1
Sexual_polarity                        1
Looking_for                            0
Points_Rank                            0
Last_login                             0
Member_since                           0
Number_of_Comments_in_public_forum     0
Time_spent_chating_H:M                 0
Number_of_advertisments_posted         0
Number_of_offline_meetings_attended    0
Profile_pictures                       0
Friends_ID_list                        0
Risk                                   0
dtype: int64

### Insert new Binary column named 'Female'

In [31]:
df.insert(5, 'Female5', df['Gender'] == 'female')
df.head()

Unnamed: 0_level_0,Female,Female1,Gender,Age,Location,Female5,Verification,Sexual_orientation,Sexual_polarity,Looking_for,Points_Rank,Last_login,Member_since,Number_of_Comments_in_public_forum,Time_spent_chating_H:M,Number_of_advertisments_posted,Number_of_offline_meetings_attended,Profile_pictures,Friends_ID_list,Risk
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
10386,False,False,male,346,A,False,Non_Verified,Homosexual,Switch,Men,50,before_10_days,17.9.2012,32,0:2,0,0,0,18260,No_risk
14,False,False,male,322,J,False,Non_Verified,Heterosexual,Dominant,Women,518,before_1_days,1.11.2009,710,3:45,9,0,0,11778320244376823969273184588431277,No_risk
16721,False,False,male,336,K,False,Non_Verified,Heterosexual,Dominant,Women,150,before_3_days,1.4.2013,25,2:15,1,1,45,198052172119802,No_risk
16957,False,False,male,34,H,False,Non_Verified,Heterosexual,Dominant,Women,114,before_4_days,8.4.2013,107,359:22,1,0,1,"40847,38183,9507,42259,5807,28118,24848,37170,...",No_risk
17125,False,False,male,395,B,False,Non_Verified,Heterosexual,Dominant,Women,497,before_5_days,14.4.2013,600,0:21,0,6,8,"1320,35739,34231,19097,20197,18069,12330,43342...",No_risk


### Missing values in `Location`

In [28]:
df['Location'].fillna(df['Location'].mode()[0], inplace=True)

### Decimal points in `Age`

We replace all commas (European decimal separator) with periods, while handling some unformatted values.

In [32]:
def comma_replace(obj):
  return obj.replace(",",".")

df['Age'].head().apply(comma_replace)# apply col wise

User_ID
10386    34.6
14       32.2
16721    33.6
16957      34
17125    39.5
Name: Age, dtype: object

In [33]:
# Lets do with single line with lambda
df['Age'] = df['Age'].apply(lambda obj: obj.replace(',', '.'))
df.head()

Unnamed: 0_level_0,Female,Female1,Gender,Age,Location,Female5,Verification,Sexual_orientation,Sexual_polarity,Looking_for,Points_Rank,Last_login,Member_since,Number_of_Comments_in_public_forum,Time_spent_chating_H:M,Number_of_advertisments_posted,Number_of_offline_meetings_attended,Profile_pictures,Friends_ID_list,Risk
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
10386,False,False,male,34.6,A,False,Non_Verified,Homosexual,Switch,Men,50,before_10_days,17.9.2012,32,0:2,0,0,0,18260,No_risk
14,False,False,male,32.2,J,False,Non_Verified,Heterosexual,Dominant,Women,518,before_1_days,1.11.2009,710,3:45,9,0,0,11778320244376823969273184588431277,No_risk
16721,False,False,male,33.6,K,False,Non_Verified,Heterosexual,Dominant,Women,150,before_3_days,1.4.2013,25,2:15,1,1,45,198052172119802,No_risk
16957,False,False,male,34.0,H,False,Non_Verified,Heterosexual,Dominant,Women,114,before_4_days,8.4.2013,107,359:22,1,0,1,"40847,38183,9507,42259,5807,28118,24848,37170,...",No_risk
17125,False,False,male,39.5,B,False,Non_Verified,Heterosexual,Dominant,Women,497,before_5_days,14.4.2013,600,0:21,0,6,8,"1320,35739,34231,19097,20197,18069,12330,43342...",No_risk


In [34]:
# Error: Convering age to numeric
pd.to_numeric(df['Age'])

ValueError: ignored

In [35]:
# Method 1
df['Age'] = df['Age'].replace('???', np.nan)
df['Age'] = df['Age'].astype(float)

In [36]:
# Method 2
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

In [37]:
# Lets check the missing values
df.isnull().sum()

Female                                 0
Female1                                0
Gender                                 0
Age                                    6
Location                               0
Female5                                0
Verification                           0
Sexual_orientation                     1
Sexual_polarity                        1
Looking_for                            0
Points_Rank                            0
Last_login                             0
Member_since                           0
Number_of_Comments_in_public_forum     0
Time_spent_chating_H:M                 0
Number_of_advertisments_posted         0
Number_of_offline_meetings_attended    0
Profile_pictures                       0
Friends_ID_list                        0
Risk                                   0
dtype: int64

In [38]:
df['Age'].fillna(df['Age'].mean(), inplace=True)


In [39]:
df.isnull().sum()

Female                                 0
Female1                                0
Gender                                 0
Age                                    0
Location                               0
Female5                                0
Verification                           0
Sexual_orientation                     1
Sexual_polarity                        1
Looking_for                            0
Points_Rank                            0
Last_login                             0
Member_since                           0
Number_of_Comments_in_public_forum     0
Time_spent_chating_H:M                 0
Number_of_advertisments_posted         0
Number_of_offline_meetings_attended    0
Profile_pictures                       0
Friends_ID_list                        0
Risk                                   0
dtype: int64

### Convert `Verification` to binary data

In every entry, we change the data to whether the user is verified or not.

In [None]:
df['Verification'] = df['Verification'] != 'Non_Verified'
df[['Verification']].head()