<a href="https://colab.research.google.com/github/PosgradoMNA/actividades-de-aprendizaje-SinaiAvalos/blob/main/Semana%205/Module2_Notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Sinaí Avalos Rivera   A01730466*

# **Module 2 - Data Wrangling**



**PRE-PROCESSING DATA IN PYTHON**

Data pre-processing is a necessary step in data
analysis. It is the process of converting or mapping data from one “raw” form into another
format to make it ready for further analysis.

Data normalization: Different columns of numerical data may have very different ranges, and direct comparison is often not meaningful. Normalization is a way to bring all data into a similar range, for more useful comparison. Techniques: centering/scaling.


Data binning: Binning creates bigger categories from a set of numerical values. It is particularly useful for comparison between groups of data.

Each column is a Panda Series.

**Add 1 to each "symbolling" entry**:



```
# df['symboling'] = df["symboling"]+1
```



**DEALING WITH MISSING VALUES IN PYTHON**

When no data value is stored for a feature for a particular observation, we say this
feature has a “missing value”.
Usually, missing value in dataset appears as “?”, “N/A”, 0 or just a blank cell.

OPTIONS TO DEAL WITH MISSING VALUES:

* The first is to check if the person or group that collected the data can go back and find what the actual value should be.
* Another possibility is just to remove the data where that missing value is found: 
Drop the variable, drop the data entry
* Replacing data: Replace it with an average (of similar datapoints), replace it by frequency (in this case, one possibility is to try using the mode –the most common), replace it based on other functions
* Leave it as missing data

How to drop missing values or replace missing values in Python. To remove data that contains missing values, pandas library has a built-in method called
‘dropna’:


```
# dataframes.dropna()
```

Essentially, with the dropna method, you can choose to drop rows or columns that contain
missing values, like NaN. Specify “axis=0” to drop the rows, or “axis=1” to drop the
columns that contain the missing values.



```
# df.dropna(subset=['price'], axis=0, inplace = True)
```

Setting the argument “inplace” to “true” allows the modification to be done on the
dataset directly. “Inplace=True” just writes the result back into the dataframe.

Equivalent to:

```
# df = df.dropna(subset=['price'], axis=0)
```

This line of code does not change the dataframe, but is a good way to make sure that you are performing the correct operation. To modify the dataframe, you have to set the parameter "inplace" equal to true.

```
# df.dropna(subset=['price'], axis=0)
```

To replace missing values like NaNs with actual values, pandas library has a built in method
called ‘replace’, which can be used to fill in the missing values with the newly
calculated values.

```
# dataframe.replace(missing_value, new_value)
```


```
# mean =  df["normalized-losses"].mean
df["normalized-losses"].replace(np.nan, mean)
```



**DATA FORMATTING IN PYTHON**

Data formatting means bringing data into a common standard of expression that allows
users to make meaningful comparisons.

Converto mpg to L/100km in dataset:

```
# df["city-mpg"] = 235/df["city-mpg"]
df.rename(columns=("city-mpg":"city-L/1000km"), inplace=True)
```

For a number of reasons, including when you import a dataset into Python, the data type
may be incorrectly established. It is important for later analysis to explore the feature’s data type and convert them
to the correct data types; otherwise, the developed models later on may behave strangely, and totally valid data may end up being treated like missing data.


To identify a features data type, in Python we can use the dataframe.dtypes() method and
check the datatype of each variable in a dataframe.


```
# dataframe.dtypes()
```


Convert data types:


```
# dataframe.astype()
```

```
# df["price"] = df["price"].astype("int")
```




**DATA NORMALIZATION IN PYTHON**

We may want to normalize these variables so that the range of the values is consistent.
This normalization can make some statistical analyses easier down the road.
By making the ranges consistent between variables, normalization enables a fairer comparison
between the different features. Making sure they have the same impact, it is also important for computational reasons. By making the ranges consistent between variables, normalization enables a fairer comparison between the different features.


To avoid this, we can normalize these two variables into values that range from 0 to 1.

After normalization, both variables now have a similar influence on the models we will
build later.

Ways to normalize data: 

* The first method, called “simple feature scaling”, just divides each value by the
maximum value for that feature.
This makes the new values range between 0 and 1.

```
# df["lenght"] = df["lenght"]/df["lenght"].max()
```

* The second method, called “Min-Max”, takes each value, X_old, subtracted from the minimum
value of that feature, then divides by the range of that feature.
Again, the resulting new values range between 0 and 1.

```
# df["lenght"] = (df["lenght"]-df["lenght"].min())/(df["lenght"].max()-df["lenght"].min()) 
```

*The third method is called “z-score” or “standard score”.
In this formula, for each value, you subtract the Mu which is the average of the feature,
and then divide by the standard deviation (sigma).
The resulting values hover around 0, and typically range between -3 and +3, but can be higher
or lower. 
```
# df["lenght"] = (df["lenght"]-df["lenght"].mean())/df["lenght"].std()
```



**BINNING IN PYTHON**

Binning is when you group values together into bins. For example, you can bin “age”
into [0 to 5], [6 to 10], [11 to 15] and so on.

In addition, sometimes we use data binning to group a set of numerical values into a
smaller number of bins to have a better understanding of the data distribution.
Using binning, we categorize.



```
# bins = np.linspace(min(df["price"]),max(df["price"]),4)
group_names=["Low","Medium","High"]

df["price-binned"] = pd.cut(df["price"], bins, labels=group_names, includ_lowest=True)
```

We use the pandas function ”cut” to segment and sort the data values into bins.
You can then use histograms to visualize the distribution of the data after they’ve been
divided into bins.


**TURNING CATEGORICAL VARIABLES INTO QUANTITATIVE VARIABLES IN PYTHON**

Solution:
* Add dummy variables for each unique category
* Assign 0 or 1 in each category
* In Pandas: 
```
# pd.get_dummies(df['fuel'])
```
