# Data Preparation

## After completing materials of this notebook, you should be able to:

* Explain the concept and purpose of data cleaning
* List possible solutions for handling missing data
* Explain the role and perform basic methods for data reduction
* Define and handle inconsistent data
* Discuss the importance and process of attribute reduction

## We will examine data cleaning in four different ways: 
1. handling missing data,
2. reducing data (observations),
3. handling inconsistent data
4. and reducing attributes

In [5]:
import pandas as pd
internet_data_set = pd.read_csv('files/Internet_Dataset.csv')
internet_data_set

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,M,White,1972,M,8,1,Firefox,Google,Yahoo,Y,N,N,Y,N,
1,M,Hispanic,1981,S,14,2,Chrome,Google,Hotmail,Y,N,N,Y,N,
2,F,African American,1977,S,6,2,Firefox,Yahoo,Yahoo,Y,Y,,Y,N,
3,F,White,1961,D,8,6,Firefox,Google,Hotmail,N,Y,N,N,Y,
4,M,White,1954,M,2,3,Internet Explorer,Bing,Hotmail,Y,Y,N,Y,N,
5,M,African American,N,D,15,4,Internet Explorer,Google,Yahoo,Y,N,Y,N,N,
6,M,African American,1981,D,11,2,Firefox,Google,Yahoo,,Y,,Y,Y,LinkedIn
7,M,White,1977,S,3,3,Internet Explorer,Yahoo,Yahoo,Y,,,Y,99,LinkedIn
8,F,African American,1969,M,6,2,Firefox,Google,Gmail,N,Y,N,N,N,
9,M,White,1987,S,12,-1,Safari,Yahoo,Yahoo,Y,,Y,Y,N,MySpace


    Depending on your objective in data mining, you may choose to leave missing data as they are, or you may wish to replace missing data with some other value.

### Replacing missing values
    Check for Missing data
    isnull is a method in Dataframe class which checks null values

In [6]:
is_null = internet_data_set.isnull()
is_null 

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
2,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
5,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
6,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False
7,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
9,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False


    try all and any!!!

In [7]:
internet_data_set.isnull().any()

Gender                     False
Race                       False
Birth_Year                 False
Marital_Status             False
Years_on_Internet          False
Hours_Per_Day              False
Preferred_Browser          False
Preferred_Search_Engine    False
Preferred_Email            False
Read_News                   True
Online_Shopping             True
Online_Gaming               True
Facebook                   False
Twitter                    False
Other_Social_Network        True
dtype: bool

In [8]:
# numpy approach
is_null_values = internet_data_set.isnull().values
internet_data_set.isnull().values.any(axis=0)

array([False, False, False, False, False, False, False, False, False,
        True,  True,  True, False, False,  True])

In [12]:
# reloading data
internet_data_set = pd.read_csv('files/Internet_Dataset.csv')
print(f'Is there any null values in dataset?? {internet_data_set.isnull().any().values.any()}')

Is there any null values in dataset?? True


### Replace nan with mode of the attributes

In [13]:
internet_data_set.mode()

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,M,White,1977.0,D,6,2.0,Firefox,Google,Yahoo,Y,Y,N,Y,N,LinkedIn
1,,,1981.0,S,8,,,,,,,,,,
2,,,,,12,,,,,,,,,,


In [14]:
# filling nans in Online Gaming with nans
df = pd.read_csv('files/Internet_Dataset.csv')
df['Online_Gaming']

0       N
1       N
2     NaN
3       N
4       N
5       Y
6     NaN
7     NaN
8       N
9       Y
10      N
Name: Online_Gaming, dtype: object

In [15]:
df['Online_Gaming'].fillna('N')

0     N
1     N
2     N
3     N
4     N
5     Y
6     N
7     N
8     N
9     Y
10    N
Name: Online_Gaming, dtype: object

In [16]:
df['Online_Gaming'].fillna('N', inplace = True)
# df['Online_Gaming'] = df['Online_Gaming'].fillna('N')
df['Online_Gaming'] 

0     N
1     N
2     N
3     N
4     N
5     Y
6     N
7     N
8     N
9     Y
10    N
Name: Online_Gaming, dtype: object

In [None]:
# automatically
df = pd.read_csv('files/Internet_Dataset.csv')
df['Online_Gaming'].fillna(df['Online_Gaming'].mode()[0], inplace = True)
df['Online_Gaming']

In [22]:
df.columns

Index(['Gender', 'Race', 'Birth_Year', 'Marital_Status', 'Years_on_Internet',
       'Hours_Per_Day', 'Preferred_Browser', 'Preferred_Search_Engine',
       'Preferred_Email', 'Read_News', 'Online_Shopping', 'Online_Gaming',
       'Facebook', 'Twitter', 'Other_Social_Network'],
      dtype='object')

In [33]:
# for all columns
df = pd.read_csv('files/Internet_Dataset.csv')
for column in df.columns:
    df[column] = df[column].fillna(df[column].mode()[0])
df

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,M,White,1972,M,8,1,Firefox,Google,Yahoo,Y,N,N,Y,N,LinkedIn
1,M,Hispanic,1981,S,14,2,Chrome,Google,Hotmail,Y,N,N,Y,N,LinkedIn
2,F,African American,1977,S,6,2,Firefox,Yahoo,Yahoo,Y,Y,N,Y,N,LinkedIn
3,F,White,1961,D,8,6,Firefox,Google,Hotmail,N,Y,N,N,Y,LinkedIn
4,M,White,1954,M,2,3,Internet Explorer,Bing,Hotmail,Y,Y,N,Y,N,LinkedIn
5,M,African American,N,D,15,4,Internet Explorer,Google,Yahoo,Y,N,Y,N,N,LinkedIn
6,M,African American,1981,D,11,2,Firefox,Google,Yahoo,Y,Y,N,Y,Y,LinkedIn
7,M,White,1977,S,3,3,Internet Explorer,Yahoo,Yahoo,Y,Y,N,Y,99,LinkedIn
8,F,African American,1969,M,6,2,Firefox,Google,Gmail,N,Y,N,N,N,LinkedIn
9,M,White,1987,S,12,-1,Safari,Yahoo,Yahoo,Y,Y,Y,Y,N,MySpace


### Drop nan observations
    In some cases, where the mode or other numeric operatioin are not a good representetive of missing values, one simply can drop missing datum.

In [35]:
df = pd.read_csv('files/Internet_Dataset.csv')
df

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,M,White,1972,M,8,1,Firefox,Google,Yahoo,Y,N,N,Y,N,
1,M,Hispanic,1981,S,14,2,Chrome,Google,Hotmail,Y,N,N,Y,N,
2,F,African American,1977,S,6,2,Firefox,Yahoo,Yahoo,Y,Y,,Y,N,
3,F,White,1961,D,8,6,Firefox,Google,Hotmail,N,Y,N,N,Y,
4,M,White,1954,M,2,3,Internet Explorer,Bing,Hotmail,Y,Y,N,Y,N,
5,M,African American,N,D,15,4,Internet Explorer,Google,Yahoo,Y,N,Y,N,N,
6,M,African American,1981,D,11,2,Firefox,Google,Yahoo,,Y,,Y,Y,LinkedIn
7,M,White,1977,S,3,3,Internet Explorer,Yahoo,Yahoo,Y,,,Y,99,LinkedIn
8,F,African American,1969,M,6,2,Firefox,Google,Gmail,N,Y,N,N,N,
9,M,White,1987,S,12,-1,Safari,Yahoo,Yahoo,Y,,Y,Y,N,MySpace


In [36]:
df.dropna(subset = ['Other_Social_Network'], inplace=True)
df

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
6,M,African American,1981,D,11,2,Firefox,Google,Yahoo,,Y,,Y,Y,LinkedIn
7,M,White,1977,S,3,3,Internet Explorer,Yahoo,Yahoo,Y,,,Y,99,LinkedIn
9,M,White,1987,S,12,-1,Safari,Yahoo,Yahoo,Y,,Y,Y,N,MySpace
10,F,Hispanic,1959,D,12,5,Chrome,Google,Gmail,Y,N,N,Y,N,Google+


### Inconsistent Data
    Inconsistent data occurs when a value does exist, but its value is not valid or meaningful.

### Check for inconsistent data

    one way is to check colums' dtypes

In [37]:
# reloading 
internet_dataset = pd.read_csv('files/Internet_Dataset.csv')

In [38]:
internet_dataset.dtypes

Gender                     object
Race                       object
Birth_Year                 object
Marital_Status             object
Years_on_Internet           int64
Hours_Per_Day               int64
Preferred_Browser          object
Preferred_Search_Engine    object
Preferred_Email            object
Read_News                  object
Online_Shopping            object
Online_Gaming              object
Facebook                   object
Twitter                    object
Other_Social_Network       object
dtype: object

    second way is to check uniques

In [39]:
for col in internet_dataset.columns:
    print(f'{col :30s} >>>>      {internet_dataset[col].unique()}')

Gender                         >>>>      ['M' 'F']
Race                           >>>>      ['White' 'Hispanic' 'African American']
Birth_Year                     >>>>      ['1972' '1981' '1977' '1961' '1954' 'N' '1969' '1987' '1959']
Marital_Status                 >>>>      ['M' 'S' 'D']
Years_on_Internet              >>>>      [ 8 14  6  2 15 11  3 12]
Hours_Per_Day                  >>>>      [ 1  2  6  3  4 -1  5]
Preferred_Browser              >>>>      ['Firefox' 'Chrome' 'Internet Explorer' 'Safari']
Preferred_Search_Engine        >>>>      ['Google' 'Yahoo' 'Bing']
Preferred_Email                >>>>      ['Yahoo' 'Hotmail' 'Gmail']
Read_News                      >>>>      ['Y' 'N' nan]
Online_Shopping                >>>>      ['N' 'Y' nan]
Online_Gaming                  >>>>      ['N' nan 'Y']
Facebook                       >>>>      ['Y' 'N']
Twitter                        >>>>      ['N' 'Y' '99']
Other_Social_Network           >>>>      [nan 'LinkedIn' 'MySpace' 'Google+']


    anything else which may help??

In [40]:
internet_dataset.describe()

Unnamed: 0,Years_on_Internet,Hours_Per_Day
count,11.0,11.0
mean,8.818182,2.636364
std,4.331701,1.911687
min,2.0,-1.0
25%,6.0,2.0
50%,8.0,2.0
75%,12.0,3.5
max,15.0,6.0


In [41]:
internet_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Gender                   11 non-null     object
 1   Race                     11 non-null     object
 2   Birth_Year               11 non-null     object
 3   Marital_Status           11 non-null     object
 4   Years_on_Internet        11 non-null     int64 
 5   Hours_Per_Day            11 non-null     int64 
 6   Preferred_Browser        11 non-null     object
 7   Preferred_Search_Engine  11 non-null     object
 8   Preferred_Email          11 non-null     object
 9   Read_News                10 non-null     object
 10  Online_Shopping          9 non-null      object
 11  Online_Gaming            8 non-null      object
 12  Facebook                 11 non-null     object
 13  Twitter                  11 non-null     object
 14  Other_Social_Network     4 non-null      obj

In [42]:
internet_dataset = pd.read_csv('files/Internet_Dataset.csv')
internet_dataset['Twitter'] == '99'

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9     False
10    False
Name: Twitter, dtype: bool

In [43]:
# Let's find them
internet_dataset['Twitter'][internet_dataset['Twitter'] == '99'].index

Int64Index([7], dtype='int64')

In [44]:
internet_dataset = pd.read_csv('files/Internet_Dataset.csv')
internet_dataset['Twitter'].replace('99', 'N', inplace=True)
internet_dataset

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,M,White,1972,M,8,1,Firefox,Google,Yahoo,Y,N,N,Y,N,
1,M,Hispanic,1981,S,14,2,Chrome,Google,Hotmail,Y,N,N,Y,N,
2,F,African American,1977,S,6,2,Firefox,Yahoo,Yahoo,Y,Y,,Y,N,
3,F,White,1961,D,8,6,Firefox,Google,Hotmail,N,Y,N,N,Y,
4,M,White,1954,M,2,3,Internet Explorer,Bing,Hotmail,Y,Y,N,Y,N,
5,M,African American,N,D,15,4,Internet Explorer,Google,Yahoo,Y,N,Y,N,N,
6,M,African American,1981,D,11,2,Firefox,Google,Yahoo,,Y,,Y,Y,LinkedIn
7,M,White,1977,S,3,3,Internet Explorer,Yahoo,Yahoo,Y,,,Y,N,LinkedIn
8,F,African American,1969,M,6,2,Firefox,Google,Gmail,N,Y,N,N,N,
9,M,White,1987,S,12,-1,Safari,Yahoo,Yahoo,Y,,Y,Y,N,MySpace


In [45]:
# we can drop them to
internet_dataset = pd.read_csv('files/Internet_Dataset.csv')
index = internet_dataset.index[internet_data_set.Twitter == '99']
internet_dataset.drop(index=index, axis=0, inplace=True)
internet_dataset

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,M,White,1972,M,8,1,Firefox,Google,Yahoo,Y,N,N,Y,N,
1,M,Hispanic,1981,S,14,2,Chrome,Google,Hotmail,Y,N,N,Y,N,
2,F,African American,1977,S,6,2,Firefox,Yahoo,Yahoo,Y,Y,,Y,N,
3,F,White,1961,D,8,6,Firefox,Google,Hotmail,N,Y,N,N,Y,
4,M,White,1954,M,2,3,Internet Explorer,Bing,Hotmail,Y,Y,N,Y,N,
5,M,African American,N,D,15,4,Internet Explorer,Google,Yahoo,Y,N,Y,N,N,
6,M,African American,1981,D,11,2,Firefox,Google,Yahoo,,Y,,Y,Y,LinkedIn
8,F,African American,1969,M,6,2,Firefox,Google,Gmail,N,Y,N,N,N,
9,M,White,1987,S,12,-1,Safari,Yahoo,Yahoo,Y,,Y,Y,N,MySpace
10,F,Hispanic,1959,D,12,5,Chrome,Google,Gmail,Y,N,N,Y,N,Google+


In [55]:
# assigining values using `at`
column = "Twitter"
internet_dataset = pd.read_csv('files/Internet_Dataset.csv')
consistent_values = ["Y", "N"]
inconsistent_values = list(set([value for value in internet_data_set[column].values if value not in consistent_values]))
inconsistent_values

['99']

In [56]:
mode = internet_data_set[column].mode()[0]
for inconsistent_value in inconsistent_values:
    indexes = internet_dataset.index[internet_data_set[column] == inconsistent_value]
    for index in indexes:
        internet_dataset.at[index, column] = mode 
internet_dataset

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,M,White,1972,M,8,1,Firefox,Google,Yahoo,Y,N,N,Y,N,
1,M,Hispanic,1981,S,14,2,Chrome,Google,Hotmail,Y,N,N,Y,N,
2,F,African American,1977,S,6,2,Firefox,Yahoo,Yahoo,Y,Y,,Y,N,
3,F,White,1961,D,8,6,Firefox,Google,Hotmail,N,Y,N,N,Y,
4,M,White,1954,M,2,3,Internet Explorer,Bing,Hotmail,Y,Y,N,Y,N,
5,M,African American,N,D,15,4,Internet Explorer,Google,Yahoo,Y,N,Y,N,N,
6,M,African American,1981,D,11,2,Firefox,Google,Yahoo,,Y,,Y,Y,LinkedIn
7,M,White,1977,S,3,3,Internet Explorer,Yahoo,Yahoo,Y,,,Y,N,LinkedIn
8,F,African American,1969,M,6,2,Firefox,Google,Gmail,N,Y,N,N,N,
9,M,White,1987,S,12,-1,Safari,Yahoo,Yahoo,Y,,Y,Y,N,MySpace


### Attribute Reduction 

In [57]:
internet_data_set = pd.read_csv('files/Internet_Dataset.csv')
internet_data_set.drop(['Other_Social_Network'], axis=1, inplace=True)
internet_data_set

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter
0,M,White,1972,M,8,1,Firefox,Google,Yahoo,Y,N,N,Y,N
1,M,Hispanic,1981,S,14,2,Chrome,Google,Hotmail,Y,N,N,Y,N
2,F,African American,1977,S,6,2,Firefox,Yahoo,Yahoo,Y,Y,,Y,N
3,F,White,1961,D,8,6,Firefox,Google,Hotmail,N,Y,N,N,Y
4,M,White,1954,M,2,3,Internet Explorer,Bing,Hotmail,Y,Y,N,Y,N
5,M,African American,N,D,15,4,Internet Explorer,Google,Yahoo,Y,N,Y,N,N
6,M,African American,1981,D,11,2,Firefox,Google,Yahoo,,Y,,Y,Y
7,M,White,1977,S,3,3,Internet Explorer,Yahoo,Yahoo,Y,,,Y,99
8,F,African American,1969,M,6,2,Firefox,Google,Gmail,N,Y,N,N,N
9,M,White,1987,S,12,-1,Safari,Yahoo,Yahoo,Y,,Y,Y,N


*:)*