**Learning Objectives**

* Identify and hanfle missing values.
* Data Formatting
* Data Normalization(centering/Scaling)
* Data Binning
  * Binning creates bigger categories from a set of numerical values. It is particularly useful for comparison between groups of data
* Turning Categorical values to numeric variables.

**Dealing with Missing values in Python**

* Missing values occur when no data value is stored for a variable(feature) in an observation.
* Could be represented as "?","N/A", 0 or just a blank cell.

**How to deal with missing data?**
- Check with the data collection source
- Drop the missing values
  - drop the variable
  - drop the data entry with the missing value.
- Replace the missing values
  - replace it with an average (of similar datapoints)
  - replace it by frequency.
  - replace it based on other functions
- Leave it as missing data.

You deal with missing values for categorical data by: 

- replacing the missing value with the mode of the particular column

- replacing the missing value with the value that appears most often of the particular column


droping missing values:

```python
dataframes.dropna()
```
* axis=0 drops the entire row
* axis=1 drops the entire column

```python
df.dropna(subset=["price"], axis=0, inplace=True)
```

inplace= True does the modification on the dataset Directly.
alternatively

```python
df= df.dropna(subset=["price"], axis=0)
```

**How to replace missing values in Python**

Use ```python dataframe.replace(missing_value, new_value)```:

```python
mean = df["normalized-losses"].mean()
df["normalized-losses"].replace(np.na,mean)
```

## Data Formatting in Python

* Bringing data into a common standard of expression allows users to make meaningful comparison.

|Non-formatted|Formatted|
|--|--|
|confusing|more clear|
|hard to aggregate|Easy to aggregate|
|hard to compare|easy to compare|

e.g.

|City|City|
|--|--|
|NY|New York|
|New York|New York|
|N.Y|New York|
|N.Y|New York|

**Applyingcalculations to an entire column**

However, you may be someone who lives in a country that uses metric units. So, you would want to convert those values to liters per 100 kilometers, the metric version. To transform miles per gallon to liters per 100 kilometers, we need to divide 235 by each value in the city-miles per gallon column.

* Convert "mpg" to "L/100km" in Car dataset.

|city-mpg|to|city-L/100km|
|--|--|--|
|21|$\rightarrow$|11.2|
|21|$\rightarrow$|11.2|
|19|$\rightarrow$|12.4|
|...|$\rightarrow$|...|

```python
df["city-mpg"]= 235/df["city-mpg"] # convertion
df.rename(columns={"city_mpg":"city-L/100km"},inplace=True) # rename column
```

**Incorrect data types**

* Sometimes the wrong data type is assigned to a feature.
  
**Data types in Python and Pandas**

- Objects: "A","Hello"
- Int64: 1,3,5
- Float64: 2.123

To identify data types:
* Use 
  ```python
   dataframe.dtypes()
  ``` 
   to identify data type.

To convert data types:
* Use ```python dataframe.astype() ```to convert data type.

Example: convert data type to integer in column "price"
```python
df["price"] = df["price"].astype("int")
```

## Data Normalization in Python

consider a data with features: ["length","width","height"] 
feature length ranges from 150-250, while feature width and height ranges from 50-100. We may want to normalize these variables so that the range of the values is consistent.

| |length|width|height|
|--|--|--|--|
|scale|[150,250]|[50,100]|[50,100]|
|impact|large|small|small|

another example:

|age|income|
|--|--|
|20|100000|
|30|20000|
|40|500000|

**Not-normalized**

* "age" and "income" are in different range.
* hard to compare.
* "income" will influence the result more like in linear regression.

**Normalized**

|age|income|
|--|--|
|0.2|0.2|
|0.3|0.04|
|0.4|1|


* similar value range
* similar intrinsic influence on analytical model.


**Methods of normalizing data**

1. Simple Feature scaling:
   
   divides each value by the maximum value for that feature. This makes the new values range between zero and one.
   $$x_{new}=\frac {x_{old}}{x_{max}}$$
2. Min-Max:
   
   takes each value X_old subtract it from the minimum value of that feature, then divides by the range of that feature. Again, the resulting new values range between 0 and 1. 
   
   $$ x_{new}=\frac {x_{old}-x_{min}}{x_{max}-x_{min}}$$
3. Z-score or data Standardization:

we apply the z-score method on length feature to normalize the values. Here we apply the mean and STD method on the length feature. Mean method will return the average value of the feature in the data set, and STD method will return the standard deviation of the features in the data set.
Typically ranges between -3 and +3, but can be higher or lower.

$$x_{new}= \frac {x_{old}-\mu}{\sigma} $$

**Python**

Given a dataset:

|length|width|height|
|--|--|--|
|168.8|64.1|48.8|
|180.0|65.5|52.4|
|...|...|...|
Normalize the length feature:

1. Feature scaling:
   ```python
   df["length"]=df["length"]/df["length"].max()
   ```
2. Min-Max:
   ```python
   df["length"]=(df["length"]-df["length"].min())/ (df["length"].max()-df["length"].min())
   ```
3. Z-score:
   ```python
   df["length"]=(df["length"]-df["length"].mean())/ (df["length"].std())
   ```


## Binning in Python

* Binning is when you group values together into bins.
* Converts numeric into categorical variables For example, you can bin “age” into [0 to 5], [6 to 10], [11 to 15] and so on.
* Group a set of numerical values into a set of "bins"
* Binning can improve the accuracy of a model.
* "price" is a feature range from 5,000 to 45,000
* (in order to have a better representation of price) we categorize it into low, medium and high price.

In Python we can easily implement the binning: We would like 3 bins of equal binwidth, so we need 4 numbers as dividers that are equal distance apart. 
we use the numpy function “linspace” to return the array “bins” that contains 4 equally spaced numbers over the specified interval of the price.

```python
bins = np.linspace(min(df["price"]),max(df["price"]),4)
group_names = ["Low", "Medium", "High"] # list group containing different name.
df["price-binned"] = pd.cut(df["price"],bins, labels=group_names, include_lowest=True)# cut segments and store the data values into bin
```

Then use histogram to see the bins distribution.


## Turning categorical variables into quantitative variables in Python

**Categorical Variables**

**Problem**:
* Most statistical models cannot take in the object/strings as input.
In the car data set, the fuel type feature as a categorical variable has two values, gas or diesel, which are in string format.

- Convert these variables into some form of numeric format.
  |Car| Fuel|...|gas|diesel|
  |--|--|--|--|--|
  |A|gas|...|1|0|
  |B|diesel|...|0|1|
  |c|gas|...|1|0|
  |D|gas|...|1|0|
  
  "One-hot encoding"

**Dummy Variables in Python pandas**

* Use pandas.get_dummies() method.
* Convert categorical variables to dummy variables (0 or 1)
  ```python
  pd.get_dumies(df['fuel'])
  ```

  this splits the column into gas and diesel.

