## Preprocessing for Machine Learing in Python

### CHAPTER 1. Introduction to Data Preprocessing

#### 1.1 Introduction to preprocessing

* It comes after exploratory data analysis and data cleaning
* It is a process of preparing data for modeling
* Example:
    * transforming categorical features into numerical features (dummy variables)
    * removing missing data

Why preprocess?
* Transform dataset so it's suitable for modeling
* Improve model performance
* Generate more reliable results

Recap: exploring data with pandas
* check data with *.head()*
* check data information with *.info()*
* check summary statistics with *.describe()*


In [7]:
# exploring missing data
import pandas as pd

volunteer = pd.read_csv('8_datasets/volunteer.csv')
print(volunteer.head())
print(volunteer.isna().sum())
print(volunteer.shape)

   opportunity_id  content_id  vol_requests  event_time  \
0            4996       37004            50           0   
1            5008       37036             2           0   
2            5016       37143            20           0   
3            5022       37237           500           0   
4            5055       37425            15           0   

                                               title  hits  \
0  Volunteers Needed For Rise Up & Stay Put! Home...   737   
1                                       Web designer    22   
2      Urban Adventures - Ice Skating at Lasker Rink    62   
3  Fight global hunger and support women farmers ...    14   
4                                      Stop 'N' Swap    31   

                                             summary is_priority  category_id  \
0  Building on successful events last summer and ...         NaN          NaN   
1             Build a website for an Afghan business         NaN          1.0   
2  Please join us and the stu

In [9]:
# dropping missing data
# drop 'Latitude' and 'Longitude' columns
volunteer_cols = volunteer.drop(['Latitude', 'Longitude'], axis=1)

# drop rows with missing 'category_desc' column
volunteer_subset = volunteer_cols.dropna(subset=['category_desc'])

print(volunteer_subset.shape)

(617, 33)


#### 1.2 Working with data types

Common data types:
* 'object': string/mixed types
* 'int64': integer
* 'float64': float
* 'datetime64': dates and times

Converting column types:
* *astype()* method to specify data type

In [11]:
# exploring data types
import pandas as pd
volunteer = pd.read_csv('8_datasets/volunteer.csv')
print(volunteer.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 665 entries, 0 to 664
Data columns (total 35 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   opportunity_id      665 non-null    int64  
 1   content_id          665 non-null    int64  
 2   vol_requests        665 non-null    int64  
 3   event_time          665 non-null    int64  
 4   title               665 non-null    object 
 5   hits                665 non-null    int64  
 6   summary             665 non-null    object 
 7   is_priority         62 non-null     object 
 8   category_id         617 non-null    float64
 9   category_desc       617 non-null    object 
 10  amsl                0 non-null      float64
 11  amsl_unit           0 non-null      float64
 12  org_title           665 non-null    object 
 13  org_content_id      665 non-null    int64  
 14  addresses_count     665 non-null    int64  
 15  locality            595 non-null    object 
 16  region  

In [14]:
# converting a column type -> 'hits'
print(volunteer['hits'].head())
volunteer['hits'] = volunteer['hits'].astype('int64')
print(volunteer['hits'].dtypes)

0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: int32
int64


#### 1.3 Training and test sets

*