# Topic: Data Collection & Preprocessing

### Agenda
    1) Handling Missing Values
    2) Splitting & Standardizing
    3) Labelling & Encoding
    4) Handling Imbalance Dataset
    5) Feature Extraction of Text Data
    6) Numerical Data Preprocessing Use Case (Diabetes Dataset)

### Importing Necessary Libraries

In [1]:
import numpy as np
import pandas as pd

import sklearn.datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Loading Dataset from sklearn library

In [2]:
data = sklearn.datasets.load_breast_cancer()

In [3]:
df = pd.DataFrame(data.data, columns=data.feature_names)

In [4]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [5]:
df.shape

(569, 30)

# Topic-1 Handling Missing Values

* There are 2 methods to handle missing values.
    1. Imputation - Imputation can be done using mean/median for numerical values & mode for categorical values
    2. Dropping - By dropping rows we can handle missing values. Mostly this is not preffered

In [6]:
df.isnull().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
dtype: int64

#### Since there are no missing values we proceed to next step. Else we impute using median instead of mean because mean is sensitive to outliers

# Topic-2 Splitting & Standardizing

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [8]:
X = df
Y = data.target

#### Splitting

In [9]:
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=100)

In [10]:
data.data.std()
# Standard deviation should be '1'.So we are using StandardScaler

228.29740508276657

#### Using StandardScaler for standardizing

In [11]:
scaler = StandardScaler()

In [12]:
scaler.fit(x_train)

StandardScaler()

In [13]:
x_train_standardized = scaler.transform(x_train)

In [14]:
x_test_standardized = scaler.transform(x_test)

In [15]:
x_train_standardized.std()

1.0

In [16]:
x_test_standardized.std()

1.1949838575568676

# Topic-3 Labelling & Encoding

In [17]:
from sklearn.preprocessing import LabelEncoder

In [18]:
df1 = pd.read_csv('data.csv')

In [19]:
df1.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [20]:
df1['diagnosis'].value_counts()

B    357
M    212
Name: diagnosis, dtype: int64

In [21]:
label_encoder = LabelEncoder()

In [22]:
labels = label_encoder.fit_transform(df1['diagnosis'])

In [23]:
df1['target'] = labels

In [24]:
df1['target'].value_counts()

0    357
1    212
Name: target, dtype: int64

In [25]:
df1.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32,target
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,,1
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,,1
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,,1
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,,1
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,,1


# Topic-4 Handling Imbalance Dataset

In [26]:
df1['target'].value_counts()
#0 means cancer at first stage
#1 means cancer at last stage

0    357
1    212
Name: target, dtype: int64

#### From above label '0' & '1' are nearly balanced. In general cases the data may be highly Imbalanced.

In [27]:
# Seperate the dataset by target0 and target1
target0 = df1[df1['target'] == 0]
target1 = df1[df1['target'] == 1]

In [28]:
print(target0.shape)
print(target1.shape)

(357, 34)
(212, 34)


### Implementing Under-Sampling

In [29]:
# we are sampling here because to distribute the data balancely.
# target1 = 212
target0_sample = target0.sample(n=212)

#### Concatenate two data frames

In [30]:
new_df = pd.concat([target0_sample, target1],axis=0)

In [31]:
new_df['target'].value_counts()

0    212
1    212
Name: target, dtype: int64

#### Here target '0' & '1' have same shape. This is a balanced dataset.

# Topic-5 Feature Extraction of Text Data

In [32]:
# Mapping text data to real values vectors is known as feature extraction.
# Coverting text data to numerical data

__Bag of Words:__ List of unique words in text corpus
__Term Frequency-Inverse Document Frequency(TF-idf):__ To count number of times each words appear
 * We use __TF-idf Vectorizer__. 
     * __Term Frequency(Tf):__ (No. of times term appears in document)/(Number of terms in document)
     * __Inverse Document Frequency:__ log(N/n), N--> No of words, n--> No of documents a term has appeared
         IDF of rare word is high. IDF of frequent word is low.
      __Tf-Idf value__ = Tf * Idf

In [33]:
df2 = pd.DataFrame({'id':[0,1,2,3,4], 
                    'title':['House Dem Aide: We Didn’t Even See Comey’s Let','FLYNN: Hillary Clinton, Big Woman on Campus',
                            'Why the Truth Might Get You Fired','15 Civilians Killed In Single US Airstrike',
                            'Iranian woman jailed for fictional unpublished'],
                    'author':['Darrell Lucus','Daniel J. Flynn','Consortiumnews.com','Jessica Purkiss','Howard Portnoy'],
                   'text':['House Dem Aide: We Didn’t Even See Comey’s','Ever get the feeling your life circles the',
                          'Why the Truth Might Get You Fired October 29','Videos 15 Civilians Killed In Single US',
                          'Iranian woman has been sentenced'],'label':[1,0,1,1,1]})

In [34]:
df2

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus",Daniel J. Flynn,Ever get the feeling your life circles the,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,Why the Truth Might Get You Fired October 29,1
3,3,15 Civilians Killed In Single US Airstrike,Jessica Purkiss,Videos 15 Civilians Killed In Single US,1
4,4,Iranian woman jailed for fictional unpublished,Howard Portnoy,Iranian woman has been sentenced,1


#### Creating a new column by merging author and title

In [35]:
df2['content'] = df2['author'] + str(' ')+df2['title']

In [36]:
df2.head(2)

Unnamed: 0,id,title,author,text,label,content
0,0,House Dem Aide: We Didn’t Even See Comey’s Let,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s,1,Darrell Lucus House Dem Aide: We Didn’t Even S...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus",Daniel J. Flynn,Ever get the feeling your life circles the,0,"Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo..."


In [37]:
X = df2['content'].values
Y = df2['label'].values

In [38]:
X

array(['Darrell Lucus House Dem Aide: We Didn’t Even See Comey’s Let',
       'Daniel J. Flynn FLYNN: Hillary Clinton, Big Woman on Campus',
       'Consortiumnews.com Why the Truth Might Get You Fired',
       'Jessica Purkiss 15 Civilians Killed In Single US Airstrike',
       'Howard Portnoy Iranian woman jailed for fictional unpublished'],
      dtype=object)

In [39]:
print(Y)

[1 0 1 1 1]


### Importing library

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

#### Convert textual data to feature vectors

In [41]:
vectorizer = TfidfVectorizer()

In [42]:
vectorizer.fit(X)
X = vectorizer.transform(X)

In [43]:
print(X)

  (0, 40)	0.3015113445777636
  (0, 34)	0.3015113445777636
  (0, 29)	0.3015113445777636
  (0, 28)	0.3015113445777636
  (0, 21)	0.3015113445777636
  (0, 14)	0.3015113445777636
  (0, 13)	0.3015113445777636
  (0, 12)	0.3015113445777636
  (0, 11)	0.3015113445777636
  (0, 8)	0.3015113445777636
  (0, 1)	0.3015113445777636
  (1, 42)	0.24721169864215167
  (1, 31)	0.3064125284733739
  (1, 20)	0.3064125284733739
  (1, 17)	0.6128250569467478
  (1, 10)	0.3064125284733739
  (1, 6)	0.3064125284733739
  (1, 4)	0.3064125284733739
  (1, 3)	0.3064125284733739
  (2, 43)	0.3333333333333333
  (2, 41)	0.3333333333333333
  (2, 37)	0.3333333333333333
  (2, 36)	0.3333333333333333
  (2, 30)	0.3333333333333333
  (2, 19)	0.3333333333333333
  (2, 16)	0.3333333333333333
  (2, 9)	0.3333333333333333
  (2, 7)	0.3333333333333333
  (3, 39)	0.3333333333333333
  (3, 35)	0.3333333333333333
  (3, 33)	0.3333333333333333
  (3, 27)	0.3333333333333333
  (3, 26)	0.3333333333333333
  (3, 23)	0.3333333333333333
  (3, 5)	0.333333333

#### Now these numerical values can be fed into machine learning model

# 6. Numerical Data Preprocessing Use Case (Diabetes Dataset)

#### 1. Importing necessary libraries

In [44]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#### 2. Collecting data

In [45]:
df = pd.read_csv('diabetes.csv')

#### 3. Reading data

In [46]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


#### 4. Checking shape of the dataset

In [47]:
df.shape

(768, 9)

#### 5. Checking Statistical measures of the dataset

In [48]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


#### 6. Checking information (Type of column, Null values) of the dataset

In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


###### If any Null values are there impute them using either Mean or Median for numerical data

#### 7. Seperating feature column & target column

In [51]:
# Outcome is target column, 0 represents person is non-diabetic, 1 represents person is diabetic

In [60]:
X = df.drop('Outcome',axis=1)
Y = df['Outcome']

#### Since different columns have different range of values (tens, hundreds, decimals), we are standardizing the data to common range.

#### 8. Standardizing the Data

In [62]:
from sklearn.preprocessing import StandardScaler

In [63]:
scaler = StandardScaler()

In [64]:
standardized_data = scaler.fit_transform(X)

In [65]:
standardized_data

array([[ 0.63994726,  0.84832379,  0.14964075, ...,  0.20401277,
         0.46849198,  1.4259954 ],
       [-0.84488505, -1.12339636, -0.16054575, ..., -0.68442195,
        -0.36506078, -0.19067191],
       [ 1.23388019,  1.94372388, -0.26394125, ..., -1.10325546,
         0.60439732, -0.10558415],
       ...,
       [ 0.3429808 ,  0.00330087,  0.14964075, ..., -0.73518964,
        -0.68519336, -0.27575966],
       [-0.84488505,  0.1597866 , -0.47073225, ..., -0.24020459,
        -0.37110101,  1.17073215],
       [-0.84488505, -0.8730192 ,  0.04624525, ..., -0.20212881,
        -0.47378505, -0.87137393]])

In [66]:
X = standardized_data

#### 9. Splitting the data

In [67]:
from sklearn.model_selection import train_test_split

In [68]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 10)

In [69]:
print(x_train.shape)

(614, 8)


In [70]:
print(x_test.shape)

(154, 8)


In [71]:
print(y_train.shape)

(614,)


In [72]:
print(y_test.shape)

(154,)
