# Data Preprocessing
Data Preprocessing is the Process of transforming and making a particular dataset more useful by various ways so that it can easily be parsed in form of mathematical values on which calculations can easily be performed.

## Types of Features
Before knowing the various preprocessing techniques, we will see the types of features that a dataset can have. Features can be simply called as attributes or variables.
**There are Two types of Features :**
#### 1. Categorical
Categorical features are those features which have a fixed set of values. They are of two types.

**_Nominal :_** These cannot be ordered. _Example - Color Names (Brown, Red, Green, Orange)_

**_Ordinal :_** These can be ordered. _Example - Grades (A,B,C,D,F), Size (small, normal, big, bigger)_ 
#### 2. Numerical
Numerical features have numerical values which cannot be categorized. They are also of two types

**_Discrete :_** These numbers have discrete values. they can be said as representation of whole numbers. 

_example - 2, 5, 6, 33, 55_

**_Continuous/Rational :_** These numbers can be whole numbers or rational values. 

_example - 2.3, 4, 5.65, 3.44_


## The Dataset 
Not all the steps mentioned below are applicable for each problem. These steps depends on data on which we are working.
It is hard to find a dataset where all concepts of Data Preprocessing can be applied.

The Dataset that we are going to use is a popular dataset For Chennai House Price prediction. You can find it on [github repo](https://github.com/Ashuto7h/datascience-mashup/main/House_Price_Prediction/Chennai_house_price_train.csv') or from [kaggle](https://www.kaggle.com/nishant4k/chennai-house-pricing-)

### Importing the Dataset.

In [2]:
import pandas
path = r'https://raw.githubusercontent.com/Ashuto7h/datascience-mashup/main/House_Price_Prediction/Chennai_house_price_train.csv'
dataframe = pandas.read_csv(path)
dataframe.head()

Unnamed: 0,PRT_ID,AREA,INT_SQFT,DATE_SALE,DIST_MAINROAD,N_BEDROOM,N_BATHROOM,N_ROOM,SALE_COND,PARK_FACIL,...,UTILITY_AVAIL,STREET,MZZONE,QS_ROOMS,QS_BATHROOM,QS_BEDROOM,QS_OVERALL,REG_FEE,COMMIS,SALES_PRICE
0,P03210,Karapakkam,1004,04-05-2011,131,1.0,1.0,3,AbNormal,Yes,...,AllPub,Paved,A,4.0,3.9,4.9,4.33,380000,144400,7600000
1,P09411,Anna Nagar,1986,19-12-2006,26,2.0,1.0,5,AbNormal,No,...,AllPub,Gravel,RH,4.9,4.2,2.5,3.765,760122,304049,21717770
2,P01812,Adyar,909,04-02-2012,70,1.0,1.0,3,AbNormal,Yes,...,ELO,Gravel,RL,4.1,3.8,2.2,3.09,421094,92114,13159200
3,P05346,Velachery,1855,13-03-2010,14,3.0,2.0,5,Family,No,...,NoSewr,Paved,I,4.7,3.9,3.6,4.01,356321,77042,9630290
4,P06210,Karapakkam,1226,05-10-2009,84,1.0,1.0,3,AbNormal,Yes,...,AllPub,Gravel,C,3.0,2.5,4.1,3.29,237000,74063,7406250


### Extracting basic information
Using the functions info() and describe() you can gather much information about dataset. 

info() tells about the various columns with no. of values in them and  and their data types.

describe() tells a detailed statistics of numerical features including the value count, mean, min, max, and much more.
you can also check how evenly the data is distributed, by viewing no. of values in a range of 0 to 25%, 25% to 50% ...

In [3]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7109 entries, 0 to 7108
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   PRT_ID         7109 non-null   object 
 1   AREA           7109 non-null   object 
 2   INT_SQFT       7109 non-null   int64  
 3   DATE_SALE      7109 non-null   object 
 4   DIST_MAINROAD  7109 non-null   int64  
 5   N_BEDROOM      7108 non-null   float64
 6   N_BATHROOM     7104 non-null   float64
 7   N_ROOM         7109 non-null   int64  
 8   SALE_COND      7109 non-null   object 
 9   PARK_FACIL     7109 non-null   object 
 10  DATE_BUILD     7109 non-null   object 
 11  BUILDTYPE      7109 non-null   object 
 12  UTILITY_AVAIL  7109 non-null   object 
 13  STREET         7109 non-null   object 
 14  MZZONE         7109 non-null   object 
 15  QS_ROOMS       7109 non-null   float64
 16  QS_BATHROOM    7109 non-null   float64
 17  QS_BEDROOM     7109 non-null   float64
 18  QS_OVERA

In [4]:
dataframe.describe()

Unnamed: 0,INT_SQFT,DIST_MAINROAD,N_BEDROOM,N_BATHROOM,N_ROOM,QS_ROOMS,QS_BATHROOM,QS_BEDROOM,QS_OVERALL,REG_FEE,COMMIS,SALES_PRICE
count,7109.0,7109.0,7108.0,7104.0,7109.0,7109.0,7109.0,7109.0,7061.0,7109.0,7109.0,7109.0
mean,1382.073006,99.603179,1.637029,1.21326,3.688704,3.517471,3.507244,3.4853,3.503254,376938.330708,141005.726544,10894910.0
std,457.410902,57.40311,0.802902,0.409639,1.019099,0.891972,0.897834,0.887266,0.527223,143070.66201,78768.093718,3768603.0
min,500.0,0.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,71177.0,5055.0,2156875.0
25%,993.0,50.0,1.0,1.0,3.0,2.7,2.7,2.7,3.13,272406.0,84219.0,8272100.0
50%,1373.0,99.0,1.0,1.0,4.0,3.5,3.5,3.5,3.5,349486.0,127628.0,10335050.0
75%,1744.0,148.0,2.0,1.0,4.0,4.3,4.3,4.3,3.89,451562.0,184506.0,12993900.0
max,2500.0,200.0,4.0,2.0,6.0,5.0,5.0,5.0,4.97,983922.0,495405.0,23667340.0


## 1. Handling Missing Values
when we see the output of info(), we find that the data has some missing values i.e. some columns have less values than 7109. 

Any real world data could contain missing values. The reason for this could be many like data entry error or data collection problem caused by improper functioning of a sensor. many algorithms do not support data with missing values. so it is necessary to identify them and remove them.


