   ###                                                       DATA PREPROCESSING
---
<b> Data preprocessing is an important step in the data life-cycle process. It is performed on raw data
before it gets prepared for data mining/data analytics/machine learning process.<b>
 
 There are many techniques and methodologies for pre-processing of data .
    
 A common technique is using the  *CIRT* formula
 
 C = Data <b>Cleaning<b>: refers to techniques to ‘clean’ data by removing outliers, replacing missing
values, smoothing noisy data, and correcting inconsistent data
    
 I = Data <b>Integration<b> : Data from multiple sources and with different representations are integrated
and conflicts within the data are resolved
  
 R = Data <b>Reduction<b>  : : create a condensed representation of the dataset which is smaller in volume,
while maintaining the integrity of original.
    
 T = Data <b>Transformation<b> :  transforming the data into form appropriate for Data modeling, Data
is normalized, aggregated and generalized.

 ![image.png](attachment:image.png)
 
 Fig 1: A snapshot from Wei Jie (Professor of Computing at University of West London) 
    


##### 1. Data Cleaning

In [1]:
# Display all the outputs rather than the last output only in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # This allows Jupyternotebook print all the interactive output
                                                # without resorting to print

import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
from numpy import random
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier as KN


In [2]:
# Taking out missing values 
df1 = pd.read_csv('volunteer_opportunities.csv')
df1.head(5) # Preview the first five rows

df1.info() # check brief info 



Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,...,end_date_date,status,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,...,July 30 2011,approved,,,,,,,,
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,...,February 01 2011,approved,,,,,,,,
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,...,January 29 2011,approved,,,,,,,,
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,...,March 31 2012,approved,,,,,,,,
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,February 05 2011,approved,,,,,,,,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 665 entries, 0 to 664
Data columns (total 35 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   opportunity_id      665 non-null    int64  
 1   content_id          665 non-null    int64  
 2   vol_requests        665 non-null    int64  
 3   event_time          665 non-null    int64  
 4   title               665 non-null    object 
 5   hits                665 non-null    int64  
 6   summary             665 non-null    object 
 7   is_priority         62 non-null     object 
 8   category_id         617 non-null    float64
 9   category_desc       617 non-null    object 
 10  amsl                0 non-null      float64
 11  amsl_unit           0 non-null      float64
 12  org_title           665 non-null    object 
 13  org_content_id      665 non-null    int64  
 14  addresses_count     665 non-null    int64  
 15  locality            595 non-null    object 
 16  region  


![image.png](attachment:image.png)
##### Fig 2 : From w3 schools. This gives a broader understanding of how to use "dropna"

In [3]:
# remove columns with at least 3 missing values
df_1= df1.dropna(axis =1,thresh =3)

# The dataframe file initially contained 655 rows and 35 columns 

# After removing null columns we have 

print("\nAfter removing columns:", df1.shape)
df_1.head(10)


After removing columns: (665, 35)


Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,...,region,postalcode,display_url,recurrence_type,hours,created_date,last_modified_date,start_date_date,end_date_date,status
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,...,NY,,/opportunities/4996,onetime,0,January 13 2011,June 23 2011,July 30 2011,July 30 2011,approved
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,...,NY,10010.0,/opportunities/5008,onetime,0,January 14 2011,January 25 2011,February 01 2011,February 01 2011,approved
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,...,NY,10026.0,/opportunities/5016,onetime,0,January 19 2011,January 21 2011,January 29 2011,January 29 2011,approved
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,...,NY,2114.0,/opportunities/5022,ongoing,0,January 21 2011,January 25 2011,February 14 2011,March 31 2012,approved
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,NY,10455.0,/opportunities/5055,onetime,0,January 28 2011,February 01 2011,February 05 2011,February 05 2011,approved
5,5056,37426,15,0,Queens Stop 'N' Swap,135,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,NY,11372.0,/opportunities/5056,onetime,0,January 28 2011,January 28 2011,February 12 2011,February 12 2011,approved
6,5053,37406,2,0,Staff Development Trainer,156,The Jewish Museum seeks a volunteer staff deve...,,1.0,Strengthening Communities,...,NY,10128.0,/opportunities/5053,ongoing,0,January 27 2011,January 28 2011,January 27 2011,January 27 2012,approved
7,5085,37652,20,0,CLARO Brooklyn Volunteer Attorney,4407,"Provide legal advice, information and resource...",y,2.0,Helping Neighbors in Need,...,NY,11201.0,/opportunities/5085,ongoing,0,February 07 2011,August 11 2011,February 07 2011,February 07 2012,approved
8,5091,37730,100,0,Join Cents Ability!,325,Join the Cents Ability volunteer corps to teac...,,,,...,NY,10020.0,/opportunities/5091,ongoing,0,February 09 2011,February 09 2011,February 08 2011,February 08 2012,approved
9,5093,37741,1,0,Volunteer-Community Health Advocates,928,Community Health Advocates (CHA);formerly NYC ...,,5.0,Health,...,NY,10010.0,/opportunities/5093,ongoing,0,February 09 2011,February 11 2011,February 14 2011,February 09 2012,approved


#### Note : It is important to understand the features and use case  of the dataset as this most times could pre-inform the choice of technique to apply while cleaning the dataset

In [4]:
# Removing rows with missing values 
# This dataset has a column called "category description"
# We could used the NaN status of this column to determine which row will be removed
# i.e If it is important to have values in this row, then we remove rows that have a null category dewscription 

# # check how many values are missing in this column
print(df_1['category_desc'].isnull().sum())

48


In [5]:
#Get data short of this rows with empty "category description" values
df_2 = df_1[df_1['category_desc'].notnull()]
print(df_2.shape)

(617, 24)


In [7]:
# Exploring the data types and converting the data type
print(df_2.dtypes)
df_2.head(5)

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
dtype: object


Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,...,region,postalcode,display_url,recurrence_type,hours,created_date,last_modified_date,start_date_date,end_date_date,status
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,...,NY,10010.0,/opportunities/5008,onetime,0,January 14 2011,January 25 2011,February 01 2011,February 01 2011,approved
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,...,NY,10026.0,/opportunities/5016,onetime,0,January 19 2011,January 21 2011,January 29 2011,January 29 2011,approved
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,...,NY,2114.0,/opportunities/5022,ongoing,0,January 21 2011,January 25 2011,February 14 2011,March 31 2012,approved
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,NY,10455.0,/opportunities/5055,onetime,0,January 28 2011,February 01 2011,February 05 2011,February 05 2011,approved
5,5056,37426,15,0,Queens Stop 'N' Swap,135,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,NY,11372.0,/opportunities/5056,onetime,0,January 28 2011,January 28 2011,February 12 2011,February 12 2011,approved


In [8]:
#Assuming I want to convert the datatype of the postalcode and the category id from float to int 
#Before you can convert the data type of columns in a dataframe, 
#it is expected to take out the missing values i.e the "NaN" values
# use either pandas.DataFrame.fillna(DataFrame[‘column’].mean()) or pandas.DataFrame.fillna(0) 
#depending on what the content of the column is 
df_2[['postalcode','category_id']] = df_2[['postalcode','category_id']].fillna(0)
df_2[['postalcode','category_id']]= df_2[['postalcode', 'category_id']].astype(int)
print(df_2.dtypes)



opportunity_id         int64
content_id             int64
vol_requests           int64
event_time             int64
title                 object
hits                   int64
summary               object
is_priority           object
category_id            int32
category_desc         object
org_title             object
org_content_id         int64
addresses_count        int64
locality              object
region                object
postalcode             int32
display_url           object
recurrence_type       object
hours                  int64
created_date          object
last_modified_date    object
start_date_date       object
end_date_date         object
status                object
dtype: object


##### 2. Data Transformation

Feature scaling
<br>
 An important aspect of data transformation is feature scaling. The common methods of feature scaling include:
• Normalization:
<li> log normalization: rescaling the features by applying log transformation to the feature
data, thus reduce the variance of the data.</li>
<li> min-max normalizatin: rescaling the range of features to scale the range in [0, 1] typically</li>

![image-4.png](attachment:image-4.png)

<li> mean normalization: rescaling the features </li>

![image-3.png](attachment:image-3.png)

• Standardization:
<li> Z-score Normalization: transform the features to have a mean of 0 and variance of 1 </li>

![image-2.png](attachment:image-2.png)


When to do feature scaling
<li> Model in linear space</li>
<li>Dataset features have high variance</li>
<li> Dataset features are continuous and on different scales</li>
<li> Linearity assumptions</li>

In [9]:
#Case Study: Testing the downsides of failing to feature scale
## ----- Without feature scaling

wine = pd.read_csv('wine_types.csv')
wine.head()

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [10]:
# Note that one of the columns, 'Proline', has an extremely high variance
# compared to the other columns
wine.describe()
random.seed(13)
X = wine[['Proline', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]
y = wine['Type']

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,1.938202,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.775035,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,1.0,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,1.0,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,2.0,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,3.0,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,3.0,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


In [11]:
#initiate the kneigbors classifier
knn = KN()
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)
# Score the model on the test data
print("\nAccuracy score:", knn.score(X_test, y_test))
print("\The result indicates a low accuracy score because of a high variance of the \'Proline\' feature")
# Result indidcates a low accuracy score due to
# the high variance of the 'Proline' feature


Accuracy score: 0.6
\The result indicates a low accuracy score because of a high variance of the 'Proline' feature


In [12]:
### Applying log Normalization 
wine.head(3)

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185


In [13]:
wine['Proline'].head()

0    1065
1    1050
2    1185
3    1480
4     735
Name: Proline, dtype: int64

In [19]:
#CHecking the variance of the proline column before normalization
print("The variance before log normalisation:",  wine['Proline'].var())

The variance before log normalisation: 99166.71735542428


In [20]:
wine['Proline_log'] = np.log( wine['Proline'])

In [22]:
#CHecking the variance of the proline column after normalization
print("The variance afer log normalisation:",  wine['Proline_log'].var())

The variance afer log normalisation: 0.17231366191842018


In [23]:
# CHecking the available columns 
wine.head(5)

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline,Proline_log
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065,6.97073
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050,6.956545
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185,7.077498
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480,7.299797
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735,6.59987


In [24]:
X = wine[['Proline_log', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]
y = wine['Type']

In [31]:
#Initiate Kneihgbour classifier
kn = KN()
#Split the dataset and labels into trainig and test sets 
X_train, X_test, y_train, y_test = train_test_split(X,y)
#Fit the k-nearest neighbors model to the training data
kn.fit(X_train,y_train)
#Score the model on the test data 
print("\nAccuracy score:", round(kn.score(X_test,y_test),2))
print("\The accuracy will be better compared to the initial output")


Accuracy score: 0.82
\The accuracy will be better compared to the initial output
