<a href="https://colab.research.google.com/github/PosgradoMNA/actividades-de-aprendizaje-FranciscoMedellin/blob/main/Semana_5_Modulo_02_Notas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 02 Data Analysis with Python
- Francisco Medellin Zertuche 
- A01794044

# Pre-Processing Data in Python

Data pre-processing is also often called “data cleaning” or “data wrangling”, and there
are likely other terms.

Objectives of the this module:<br>
- First, how to identify and handle missing values.
A “missing value” condition occurs whenever a data entry is left empty.
- Data from different sources
- introduction of some methods in Python pandas that can standardize the values into the same
format, or unit, or convention..
- Normalization is a way to bring all data into a similar range, for more useful comparison. Specifically, we’ll focus on the techniques of centering and scaling.
-  Data binning.
Binning creates bigger categories from a set of numerical values.
It is particularly useful for comparison between groups of data.
- Categorical variables, how to convert categorical
values into numeric variables to make statistical modeling easier.

# Dealing with missing values in Python

When no data value is stored for a feature for a particular observation, we say this
feature has a “missing value”.
Usually, missing value in dataset appears as “?”, “N/A”, 0 or just a blank cell.<br>
But how can you deal with missing data?<br>

Of course, each situation is different and should be judged differently.
However, these are the typical options you can consider:
- The first is to check if the person or group that collected the data can go back.

- Another possibility is just to remove the data where that missing value is found.
When you drop data, you can either drop the whole variable or just the single data.

  - Drop the variable.
  - Drop the data entry.

If you’re removing data, you want to look to do something that has the least amount
of impact. <br>
Replacing data is better, since no data is wasted.<br>

- Replace the mising values.
  - Replace it with an averange (of similar detapoints).

  - Replace it by Frequency. What happen in variable categoricals when you can not replace with the averange?.
  - Replace it based on other functions.

- Leave it as missing data.


How to drop missing values?
```
df.dropna()
```
You can use drop rows or columns.
- axis=0 , the entire row.
- axis = 1 , the entire column.

Drop in specific column.<br>
The parameter, *inplace = True* , allows the modyfication to be done in the dataset directly. 
```
df.dropna(subset = ["columnName"], axis=0, inplace = True)

df = df.dropna(subset = ["columnName"], axis=0)
``` 

How to replace missing values?
```
df.replace(missing_value, new_value)
```

Example to replace with the mean
```
mean  = df["columnName"].mean()
df["columnName"].replace(np.nan, mean)
``` 

# Data Formatting in Python

- Data is usually collected from different places, by different people, which may be stored in different formats.
- Data formatting means bringing data into a common standard of expression that allows
users to make meaningful comparisons.

As a part of dataset cleaning, data formatting ensures that data is consistent and easily
understandable.

Example of how to rename the new york city name.<br>
**Non-Formatted:**<br>
City:
- NY
- New York
- N.Y
- N.Y

**Formatted:**<br>
City:
- New York
- New York
- New York
- New York


**Incorrect Data Types**<br>
- Sometimes the wrong data types is assigned to a feature.
Is important that in the analysis explore that our columns has the correct data type.<br>

There are many data types in pandas.<br>

To identiy data types:
```
df.dtypes()
```

To convert data types:
```
df.astype() # to convert data type
df["price"] = df["price"].astype("int")
```

# Data Normalization

When we take a look at the used car data set, we notice in the data that the feature “length”
ranges from 150 to 250, while feature “width” and “height” ranges from 50 to 100.
We may want to normalize these variables so that the range of the values is consistent.
This normalization can make some statistical analyses easier down the road.
By making the ranges consistent between variables, normalization enables a fairer comparison
between the different features.

Consider a dataset containing two features: “age” and “income”, where “age”
ranges from 0 to 100, while “income” ranges from 0 to 20,000 and higher.
“income” is about 1,000 times larger than “age”, and ranges from 20,000 to 500,000.
So these two features are in very different ranges.
When we do further analysis, like linear regression, for example, the attribute “income” will
intrinsically influence the result more, due to its larger value, but this doesn’t necessarily
mean it is more ‘important’ as a predictor.

To avoid this, we can normalize these two variables into values that range from 0 to 1.
Compare the two tables at the right. <BR>
After normalization, both variables now have a similar influence on the models we will
build later.

**Simple feature scaling** <BR> 
Just divides each value by themaximum value for that feature.
This makes the new values range between 0 and 1.<BR> 

x_new = x_old/x_max



In [None]:
# Example in Pandas
df["column"] = df["column"]/df["column"].max() 

**Min-Max** <br> 
Takes each value, X_old, subtracted from the minimum
value of that feature, then divides by the range of that feature.
Again, the resulting new values range between 0 and 1.<br>
x_new = x_old - x_min / x_max - x_min

In [None]:
# Example in Pandas
df["column"] = (df["column"]-df["column"].min()) / (df["column"].max()-df["column"].min())

**z-score or standard score**<br>
In this formula, for each value, you subtract the Mu which is the average of the feature, and then divide by the standard deviation (sigma).<br>
The resulting values hover around 0, and typically range between -3 and +3, but c
an be higheror lower.<br>
x_new = x_old-u / v <br>
- u, miu: Mean of the feature.
- v, sigma: Standar deveatin of the feature.

In [None]:
# Example in Pandas
df["column"] = (df["column"]-df["column"].mean()) / (df["column"].std())

# Data Binning

Binning is when you group values together into bins. For example, you can bin “age”
into [0 to 5], [6 to 10], [11 to 15] and so on.
Sometimes, binning can improve accuracy of the predictive models.

In addition, sometimes we use data binning to group a set of numerical values into a
smaller number of bins to have a better understanding of the data distribution.

Example, “price” here is an attribute range from 5,000 to 45,500.<br>
Using binning, we categorize the price into three bins: low price, medium price, and high
prices.

- Binning: Grouping of values into bins.
- Converts numeric into categorical variables.
- Group a set of numerical values into a set of bins.
- "price" is a feature range from 5,000 to 45,000. In order to have a better representacion of price we can categorize the price into three bins, low, mid and high.

**Bins**
- Low (5,000, 10,000, 12,000)
- Mid (30,000, 31,000)
- High (39,000, 44,000, 44,500)

We use the numpy function “linspace” to return the array “bins” that contains
4 equally spaced numbers over the specified interval of the price.


```
bins = np.linspace(min(df["price"]),max(df["price"]), 4)
```
We create a list “group_names “ that contains the different bin names.


```
group_names = ["low", "mid", "high"]
```

We use the pandas function ”cut” to segment and sort the data values into bins.
```
df["price-binned"] = pd.cut(df["price"], bins, labels=group_names, include = True)
```
Then use histograms to visualize the distribution of the data after they’ve been
divided into bins.
```
```








# Turning categorical variables into quantitative variables

Most statistical models cannot take in objects or strings as input and, for model training,
only take the numbers as inputs.

In the car dataset, the "fuel-type" feature as a categorical variable has two values,
"gas" or "diesel”, which are in String format.

We encode the values by adding new features corresponding to each unique element in the
original feature we would like to encode.

**Solution**<br>
- Add dummy variables for each unique category.
- Assign 0 or 1 in each category. Example: gas(1), diesel(0).

This tehnique is often called "one-hot encoding"

In pandas, we can use get_dummies() method to convert categorical variables to dummy
variables (0 or 1).<br><br>
The get_dummies() method automatically generates a list of numbers, each one corresponding
to a particular category of the variable.
```
pd.get_dummies(df["fuel"])
```
| fuel | gas | diesel |
| ---- |---- | ------ |
| gas  | 1   |   0    |
| gas  | 1   |   0    |
|diesel | 0   |   1   |
|diesel | 0   |   1   |
| gas  | 1   |   0    |



