# Data Mining

## Data Preparation

### After completing materials of this notebook, you should be able to:

* Explain the concept and purpose of data scrubbing
* List possible solutions for handling missing data
* Explain the role and perform basic methods for data reduction
* Define and handle inconsistent data
* Discuss the important and process of attribute reduction

we will examine data scrubbing in four different ways: handling missing data, reducing
data (observations), handling inconsistent data, and reducing attributes.

In [14]:
import pandas as pd
internet_dataSet = pd.read_csv('Internet_DataSet.csv')
internet_dataSet

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,M,White,1972,M,8,1,Firefox,Google,Yahoo,Y,N,N,Y,N,
1,M,Hispanic,1981,S,14,2,Chrome,Google,Hotmail,Y,N,N,Y,N,
2,F,African American,1977,S,6,2,Firefox,Yahoo,Yahoo,Y,Y,,Y,N,
3,F,White,1961,D,8,6,Firefox,Google,Hotmail,N,Y,N,N,Y,
4,M,White,1954,M,2,3,Internet Explorer,Bing,Hotmail,Y,Y,N,Y,N,
5,M,African American,1982,D,15,4,Internet Explorer,Google,Yahoo,Y,N,Y,N,N,
6,M,African American,1981,D,11,2,Firefox,Google,Yahoo,,Y,,Y,Y,LinkedIn
7,M,White,1977,S,3,3,Internet Explorer,Yahoo,Yahoo,Y,,,Y,99,LinkedIn
8,F,African American,1969,M,6,2,Firefox,Google,Gmail,N,Y,N,N,N,
9,M,White,1987,S,12,1,Safari,Yahoo,Yahoo,Y,,Y,Y,N,MySpace


In [15]:
internet_dataSet = pd.read_csv('Internet_DataSet.csv' , delimiter=',')
internet_dataSet

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,M,White,1972,M,8,1,Firefox,Google,Yahoo,Y,N,N,Y,N,
1,M,Hispanic,1981,S,14,2,Chrome,Google,Hotmail,Y,N,N,Y,N,
2,F,African American,1977,S,6,2,Firefox,Yahoo,Yahoo,Y,Y,,Y,N,
3,F,White,1961,D,8,6,Firefox,Google,Hotmail,N,Y,N,N,Y,
4,M,White,1954,M,2,3,Internet Explorer,Bing,Hotmail,Y,Y,N,Y,N,
5,M,African American,1982,D,15,4,Internet Explorer,Google,Yahoo,Y,N,Y,N,N,
6,M,African American,1981,D,11,2,Firefox,Google,Yahoo,,Y,,Y,Y,LinkedIn
7,M,White,1977,S,3,3,Internet Explorer,Yahoo,Yahoo,Y,,,Y,99,LinkedIn
8,F,African American,1969,M,6,2,Firefox,Google,Gmail,N,Y,N,N,N,
9,M,White,1987,S,12,1,Safari,Yahoo,Yahoo,Y,,Y,Y,N,MySpace


Depending on your objective in data mining, you may choose to leave missing data as they are, or you may wish to replace missing data with some other value.

#### Pandas tries to find the best dtype for each column(Series)
The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in a ndarray must has the same size in byte. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of save the bytes of strings in the ndarray directly, Pandas use object ndarray, which save pointers to objects, because of this the dtype of this kind ndarray is object.

In [16]:
internet_dataSet.dtypes

Gender                     object
Race                       object
Birth_Year                  int64
Marital_Status             object
Years_on_Internet           int64
Hours_Per_Day               int64
Preferred_Browser          object
Preferred_Search_Engine    object
Preferred_Email            object
Read_News                  object
Online_Shopping            object
Online_Gaming              object
Facebook                   object
Twitter                    object
Other_Social_Network       object
dtype: object

In [17]:
# how to convert dtype:
column = internet_dataSet['Birth_Year']
print(f'column type is {column.dtype}')
print(f'first items type is {type(column[0])}')
converted_column = column.astype('str')
print(f'converted_column type is {type(converted_column[0])}')
print(f'converted_column type is {converted_column.dtype}')

column type is int64
first items type is <class 'numpy.int64'>
converted_column type is <class 'str'>
converted_column type is object


### Check for inconsistent data

In [18]:
for col in internet_dataSet.columns:
    print(f'{col :30s} >>>>      {internet_dataSet[col].unique()}')

Gender                         >>>>      ['M' 'F']
Race                           >>>>      ['White' 'Hispanic' 'African American']
Birth_Year                     >>>>      [1972 1981 1977 1961 1954 1982 1969 1987 1959]
Marital_Status                 >>>>      ['M' 'S' 'D']
Years_on_Internet              >>>>      [ 8 14  6  2 15 11  3 12]
Hours_Per_Day                  >>>>      [1 2 6 3 4 5]
Preferred_Browser              >>>>      ['Firefox' 'Chrome' 'Internet Explorer' 'Safari']
Preferred_Search_Engine        >>>>      ['Google' 'Yahoo' 'Bing']
Preferred_Email                >>>>      ['Yahoo' 'Hotmail' 'Gmail']
Read_News                      >>>>      ['Y' 'N' nan]
Online_Shopping                >>>>      ['N' 'Y' nan]
Online_Gaming                  >>>>      ['N' nan 'Y']
Facebook                       >>>>      ['Y' 'N']
Twitter                        >>>>      ['N' 'Y' '99']
Other_Social_Network           >>>>      [nan 'LinkedIn' 'MySpace' 'Google+']


### Replacing missing values

#### Check for Missing data


In [19]:
internet_dataSet.isnull()

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
2,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
5,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
6,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False
7,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
9,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False


In [20]:
internet_dataSet = pd.read_csv('Internet_DataSet.csv')
print(f'Is there any null value in dataset?? {internet_dataSet.isnull().values.any()}')
internet_dataSet[internet_dataSet.isnull().any(axis = 1)]

Is there any null value in dataset?? True


Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,M,White,1972,M,8,1,Firefox,Google,Yahoo,Y,N,N,Y,N,
1,M,Hispanic,1981,S,14,2,Chrome,Google,Hotmail,Y,N,N,Y,N,
2,F,African American,1977,S,6,2,Firefox,Yahoo,Yahoo,Y,Y,,Y,N,
3,F,White,1961,D,8,6,Firefox,Google,Hotmail,N,Y,N,N,Y,
4,M,White,1954,M,2,3,Internet Explorer,Bing,Hotmail,Y,Y,N,Y,N,
5,M,African American,1982,D,15,4,Internet Explorer,Google,Yahoo,Y,N,Y,N,N,
6,M,African American,1981,D,11,2,Firefox,Google,Yahoo,,Y,,Y,Y,LinkedIn
7,M,White,1977,S,3,3,Internet Explorer,Yahoo,Yahoo,Y,,,Y,99,LinkedIn
8,F,African American,1969,M,6,2,Firefox,Google,Gmail,N,Y,N,N,N,
9,M,White,1987,S,12,1,Safari,Yahoo,Yahoo,Y,,Y,Y,N,MySpace


replace nan with mode of the headers

In [21]:
internet_dataSet.mode()

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,M,White,1977.0,D,6,2.0,Firefox,Google,Yahoo,Y,Y,N,Y,N,LinkedIn
1,,,1981.0,S,8,,,,,,,,,,
2,,,,,12,,,,,,,,,,


In [26]:
internet_dataSet[['Years_on_Internet','Hours_Per_Day']]

Unnamed: 0,Years_on_Internet,Hours_Per_Day
0,8,1
1,14,2
2,6,2
3,8,6
4,2,3
5,15,4
6,11,2
7,3,3
8,6,2
9,12,1


In [24]:
internet_dataSet

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,M,White,1972,M,8,1,Firefox,Google,Yahoo,Y,N,N,Y,N,
1,M,Hispanic,1981,S,14,2,Chrome,Google,Hotmail,Y,N,N,Y,N,
2,F,African American,1977,S,6,2,Firefox,Yahoo,Yahoo,Y,Y,,Y,N,
3,F,White,1961,D,8,6,Firefox,Google,Hotmail,N,Y,N,N,Y,
4,M,White,1954,M,2,3,Internet Explorer,Bing,Hotmail,Y,Y,N,Y,N,
5,M,African American,1982,D,15,4,Internet Explorer,Google,Yahoo,Y,N,Y,N,N,
6,M,African American,1981,D,11,2,Firefox,Google,Yahoo,,Y,,Y,Y,LinkedIn
7,M,White,1977,S,3,3,Internet Explorer,Yahoo,Yahoo,Y,,,Y,99,LinkedIn
8,F,African American,1969,M,6,2,Firefox,Google,Gmail,N,Y,N,N,N,
9,M,White,1987,S,12,1,Safari,Yahoo,Yahoo,Y,,Y,Y,N,MySpace


In [32]:
internet_dataSet = pd.read_csv('Internet_DataSet.csv')
missed_replaced_internet_dataSet = internet_dataSet

In [36]:
missed_replaced_internet_dataSet['Online_Gaming'].fillna('N',inplace = True)

In [37]:
missed_replaced_internet_dataSet['Online_Gaming']

0     N
1     N
2     N
3     N
4     N
5     Y
6     N
7     N
8     N
9     Y
10    N
Name: Online_Gaming, dtype: object

In [27]:
internet_dataSet = pd.read_csv('Internet_DataSet.csv')
missed_replaced_internet_dataSet = internet_dataSet

missed_replaced_internet_dataSet['Online_Gaming'].fillna('N',inplace = True)


# missed_replaced_internet_dataSet['Online_Gaming'].fillna('N')
# missed_replaced_internet_dataSet['Online_Gaming'] = missed_replaced_internet_dataSet['Online_Gaming'].fillna('N')
# missed_replaced_internet_dataSet['Online_Gaming'].fillna(missed_replaced_internet_dataSet['Online_Gaming'].mode(),inplace = True) 
missed_replaced_internet_dataSet

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,M,White,1972,M,8,1,Firefox,Google,Yahoo,Y,N,N,Y,N,
1,M,Hispanic,1981,S,14,2,Chrome,Google,Hotmail,Y,N,N,Y,N,
2,F,African American,1977,S,6,2,Firefox,Yahoo,Yahoo,Y,Y,N,Y,N,
3,F,White,1961,D,8,6,Firefox,Google,Hotmail,N,Y,N,N,Y,
4,M,White,1954,M,2,3,Internet Explorer,Bing,Hotmail,Y,Y,N,Y,N,
5,M,African American,1982,D,15,4,Internet Explorer,Google,Yahoo,Y,N,Y,N,N,
6,M,African American,1981,D,11,2,Firefox,Google,Yahoo,,Y,N,Y,Y,LinkedIn
7,M,White,1977,S,3,3,Internet Explorer,Yahoo,Yahoo,Y,,N,Y,99,LinkedIn
8,F,African American,1969,M,6,2,Firefox,Google,Gmail,N,Y,N,N,N,
9,M,White,1987,S,12,1,Safari,Yahoo,Yahoo,Y,,Y,Y,N,MySpace


### Filtering
When attributes are numeric in nature, such as with ages or number of visits to a certain place, an arithmetic measure of central tendency, such as mean, median or mode might be an acceptable replacement for missing values, but in more subjective attributes, such as whether one is an online shopper or not, you may be better off simply filtering out observations where the datum is missing.

In [40]:
filterd_missed_replaced_internet_dataSet = missed_replaced_internet_dataSet
filterd_missed_replaced_internet_dataSet.dropna(subset = ['Other_Social_Network'],inplace=True)
filterd_missed_replaced_internet_dataSet

Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
6,M,African American,1981,D,11,2,Firefox,Google,Yahoo,,Y,N,Y,Y,LinkedIn
10,F,Hispanic,1959,D,12,5,Chrome,Google,Gmail,Y,N,N,Y,N,Google+


### Sampling

In [None]:
filterd_missed_replaced_internet_dataSet.sample(n = 3)

In [None]:
filterd_missed_replaced_internet_dataSet.sample(n = 3, random_state = 1)

### Inconsistent Data
Inconsistent data occurs when a value does exist, however that value is not valid or meaningful.

In [47]:
internet_dataSet = pd.read_csv('Internet_DataSet.csv')

In [44]:
internet_dataSet['Twitter'] == '99'

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9     False
10    False
Name: Twitter, dtype: bool

In [48]:
internet_dataSet['Twitter'][internet_dataSet['Twitter'] == '99']

7    99
Name: Twitter, dtype: object

In [45]:
internet_dataSet = pd.read_csv('Internet_DataSet.csv')
internet_dataSet['Twitter'][internet_dataSet['Twitter'] == '99'] = 'N'
internet_dataSet

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Gender,Race,Birth_Year,Marital_Status,Years_on_Internet,Hours_Per_Day,Preferred_Browser,Preferred_Search_Engine,Preferred_Email,Read_News,Online_Shopping,Online_Gaming,Facebook,Twitter,Other_Social_Network
0,M,White,1972,M,8,1,Firefox,Google,Yahoo,Y,N,N,Y,N,
1,M,Hispanic,1981,S,14,2,Chrome,Google,Hotmail,Y,N,N,Y,N,
2,F,African American,1977,S,6,2,Firefox,Yahoo,Yahoo,Y,Y,,Y,N,
3,F,White,1961,D,8,6,Firefox,Google,Hotmail,N,Y,N,N,Y,
4,M,White,1954,M,2,3,Internet Explorer,Bing,Hotmail,Y,Y,N,Y,N,
5,M,African American,1982,D,15,4,Internet Explorer,Google,Yahoo,Y,N,Y,N,N,
6,M,African American,1981,D,11,2,Firefox,Google,Yahoo,,Y,,Y,Y,LinkedIn
7,M,White,1977,S,3,3,Internet Explorer,Yahoo,Yahoo,Y,,,Y,N,LinkedIn
8,F,African American,1969,M,6,2,Firefox,Google,Gmail,N,Y,N,N,N,
9,M,White,1987,S,12,1,Safari,Yahoo,Yahoo,Y,,Y,Y,N,MySpace


In [None]:
# we can drop it to
internet_dataSet = pd.read_csv('Internet_DataSet.csv')
index = internet_dataSet.index[internet_dataSet.Twitter == '99']
internet_dataSet.drop(index=[index[0]] , inplace=True)
internet_dataSet

In [None]:
# what if there are more of them??
internet_dataSet = pd.read_csv('Internet_DataSet.csv')
for index_ , item in enumerate(internet_dataSet.Twitter) :
    try:
        int(item)
        internet_dataSet.Twitter[index_] =  'N'
    except : 
        pass
internet_dataSet

### Attribute Reduction 

In [54]:
internet_dataSet = pd.read_csv('Internet_DataSet.csv')
internet_dataSet.drop([0,1],axis=0, inplace=True)

In [52]:
# how to keep some columns
internet_dataSet[['Twitter','Preferred_Search_Engine']]

KeyError: "['Twitter' 'Preferred_Search_Engine'] not in index"

*:)*