# Data Preprocessing

* Before modeling, we need to preprocess/clean the data set.
* If your dataset has missing value, we need to remove the missing value.
* Machine Learning models require numerical input, if your dataset has categorical variables, you need to transform them. 
* Data preprocessing is a prerequisites for modeling.

## 1. Diagnose your dataset

In [39]:
# importing pandas package 
import pandas as pd 
# making data frame from csv file 
volunteer = pd.read_csv("volunteer_opportunities.csv") 

In [2]:
print(volunteer.shape)

(665, 35)


In [3]:
print(volunteer.columns)

Index(['opportunity_id', 'content_id', 'vol_requests', 'event_time', 'title',
       'hits', 'summary', 'is_priority', 'category_id', 'category_desc',
       'amsl', 'amsl_unit', 'org_title', 'org_content_id', 'addresses_count',
       'locality', 'region', 'postalcode', 'primary_loc', 'display_url',
       'recurrence_type', 'hours', 'created_date', 'last_modified_date',
       'start_date_date', 'end_date_date', 'status', 'Latitude', 'Longitude',
       'Community Board', 'Community Council ', 'Census Tract', 'BIN', 'BBL',
       'NTA'],
      dtype='object')


In [4]:
print(volunteer.dtypes)

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64
Census Tract          float64
BIN                   float64
BBL       

In [5]:
print(volunteer.describe())

       opportunity_id    content_id  vol_requests  event_time         hits  \
count      665.000000    665.000000    665.000000       665.0   665.000000   
mean      5374.454135  42790.643609     78.778947         0.0   345.409023   
std        234.322154   5491.720274    569.763773         0.0   530.716526   
min       4952.000000  36697.000000      1.000000         0.0     0.000000   
25%       5175.000000  38414.000000      3.000000         0.0   102.000000   
50%       5377.000000  40222.000000     12.000000         0.0   204.000000   
75%       5580.000000  49308.000000     30.000000         0.0   374.000000   
max       5782.000000  52894.000000   9999.000000         0.0  4662.000000   

       category_id  amsl  amsl_unit  org_content_id  addresses_count ...   \
count   617.000000   0.0        0.0      665.000000       665.000000 ...    
mean      2.105348   NaN        NaN    20752.207519         1.046617 ...    
std       1.412003   NaN        NaN    19143.034346         0.5371

In [6]:
print(volunteer.head())

   opportunity_id  content_id  vol_requests  event_time  \
0            4996       37004            50           0   
1            5008       37036             2           0   
2            5016       37143            20           0   
3            5022       37237           500           0   
4            5055       37425            15           0   

                                               title  hits  \
0  Volunteers Needed For Rise Up & Stay Put! Home...   737   
1                                       Web designer    22   
2      Urban Adventures - Ice Skating at Lasker Rink    62   
3  Fight global hunger and support women farmers ...    14   
4                                      Stop 'N' Swap    31   

                                             summary is_priority  category_id  \
0  Building on successful events last summer and ...         NaN          NaN   
1             Build a website for an Afghan business         NaN          1.0   
2  Please join us and the stu

In [7]:
print(volunteer.tail())

     opportunity_id  content_id  vol_requests  event_time  \
660            5640       50193             3           0   
661            5218       38711            10           0   
662            5541       47820             1           0   
663            5398       40722             2           0   
664            5507       44303             5           0   

                                                 title  hits  \
660          Volunteer for NYLAG's Food Stamps Project   197   
661    Iridescent Science Studio Open House Volunteers   113   
662                                  French Translator   145   
663                  Marketing & Advertising Volunteer   330   
664  Volunteer filmmakers to help Mayor's Office wi...   304   

                                               summary is_priority  \
660  Volunteers needed to file for fair hearings, d...         NaN   
661  Come out to the South Bronx to help us hold ou...         NaN   
662  Volunteer needed to translate wri

In [8]:
volunteer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 665 entries, 0 to 664
Data columns (total 35 columns):
opportunity_id        665 non-null int64
content_id            665 non-null int64
vol_requests          665 non-null int64
event_time            665 non-null int64
title                 665 non-null object
hits                  665 non-null int64
summary               665 non-null object
is_priority           62 non-null object
category_id           617 non-null float64
category_desc         617 non-null object
amsl                  0 non-null float64
amsl_unit             0 non-null float64
org_title             665 non-null object
org_content_id        665 non-null int64
addresses_count       665 non-null int64
locality              595 non-null object
region                665 non-null object
postalcode            659 non-null float64
primary_loc           0 non-null float64
display_url           665 non-null object
recurrence_type       665 non-null object
hours                 

In [13]:
print(volunteer.isnull().sum())

opportunity_id          0
content_id              0
vol_requests            0
event_time              0
title                   0
hits                    0
summary                 0
is_priority           603
category_id            48
category_desc          48
amsl                  665
amsl_unit             665
org_title               0
org_content_id          0
addresses_count         0
locality               70
region                  0
postalcode              6
primary_loc           665
display_url             0
recurrence_type         0
hours                   0
created_date            0
last_modified_date      0
start_date_date         0
end_date_date           0
status                  0
Latitude              665
Longitude             665
Community Board       665
Community Council     665
Census Tract          665
BIN                   665
BBL                   665
NTA                   665
dtype: int64


In [16]:
volunteer["category_desc"]

0                            NaN
1      Strengthening Communities
2      Strengthening Communities
3      Strengthening Communities
4                    Environment
5                    Environment
6      Strengthening Communities
7      Helping Neighbors in Need
8                            NaN
9                         Health
10     Strengthening Communities
11                     Education
12     Strengthening Communities
13                     Education
14     Helping Neighbors in Need
15     Strengthening Communities
16     Helping Neighbors in Need
17     Strengthening Communities
18     Helping Neighbors in Need
19     Strengthening Communities
20                   Environment
21                   Environment
22                        Health
23                        Health
24     Strengthening Communities
25                           NaN
26                           NaN
27                        Health
28                           NaN
29                   Environment
          

## 2. Remove missing data

### Remove missing values with notnull() & boolean filter

In [21]:
# Check how many values are missing in the category_desc column
print(volunteer["category_desc"].isnull().sum())

# Subset the volunteer dataset
boolean_filer = volunteer["category_desc"].notnull()
volunteer_subset = volunteer[boolean_filer]

# Print out the shape of the subset
print(volunteer_subset.shape)
print(volunteer_subset["category_desc"].isnull().sum())

48
(617, 35)
0


### Drop all rows from a dataframe that contain missing values with dropna()

In [28]:
print(volunteer.shape)
volunteer_droped = volunteer["category_desc"].dropna()
print(volunteer_droped.shape)

(665, 35)
(617,)


### Drop columns with drop and axis

In [31]:
# Create a data with all columns except category_desc
volunteer_X = volunteer.drop(['category_desc','BBL'], axis=1)
# Create a category_desc labels dataset
volunteer_y = volunteer[['category_desc']]

print(volunteer_X.shape)


(665, 33)


In [40]:
volunteer_test = volunteer.drop([0,1], axis=0)
print(volunteer_test.shape)

(663, 35)


## Types in pyhton
* Object: string/mixed types

* int64: integer

* float64: float

## 3. Converting column types

In [43]:
# Print the head of the hits column
print(volunteer["hits"].head())
#print(volunteer.dtypes)

# Convert the hits column to type int
volunteer["hits"] = volunteer["hits"].astype(int)
print(volunteer.dtypes)

# Look at the dtypes of the dataset

0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: object
opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                   object
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64

In [61]:
a = 3.1456789
print("{0:.4f}".format(a))

0.4
3.1457


## 4. Class distribution

In [57]:
volunteer["category_desc"].value_counts()

Strengthening Communities    307
Helping Neighbors in Need    119
Education                     92
Health                        52
Environment                   32
Emergency Preparedness        15
Name: category_desc, dtype: int64

In [58]:
volunteer["category_desc"].count()

617

## 5. Stratified sampling
We know that the distribution of variables in the category_desc column in the volunteer dataset is uneven. If we wanted to train a model to try to predict category_desc, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.

In [59]:
from sklearn.model_selection import train_test_split  

# Create a data with all columns except category_desc
volunteer_X = volunteer.drop(['category_desc'], axis=1)

# Create a category_desc labels dataset
volunteer_y = volunteer[['category_desc']]

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify = volunteer_y)

# Print out the category_desc counts on the training y labels
print(y_train['category_desc'].value_counts())

Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: category_desc, dtype: int64


# scikit-learn models assume normally distributed data
## Standardize your data
## Log normalization and feature scaling
## Applied to continous numerical data 

In [28]:
from sklearn.model_selection import train_test_split  

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify = volunteer_y)

# Print out the category_desc counts on the training y labels
print(y_train['category_desc'].value_counts())
print(y_test['category_desc'].value_counts())

Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: category_desc, dtype: int64
Strengthening Communities    77
Helping Neighbors in Need    30
Education                    23
Health                       13
Environment                   8
Emergency Preparedness        4
Name: category_desc, dtype: int64
