# Data Wrangling - More Notes

## Missing Values

### Dropping Missing Values
To drop missing values, use
` dataframes.dropna()`
This won't change the dataframe, unless the parameter `inplace = True` is used.

The argument `axis=0` will drop the entire row, and `axis=1` will drop the entire column.


### Replacing Missing Values

To replace missing values, use `dataframe.replace(missing_value, new_value)`

# ![Replacing-Missing-Values](../Images/Replacing-Missing-Values.png)

## Correcting Data Types

### Identifying Data Types
Use `dataframe.dtypes()` to identify data type

### Converting Data Types
Use `dataframe.astype()` to convert data type. e.g. converting data type to integer in column 'price' would be `df['price'] = df['price'].astype('int')`

## Normalising Data
The following are ways to normalise data, and each type has its own use case.

### Simple Feature Scaling
`df['column'] = df['column']/df['column'].max()`
# ![Simple-Feature-Scaling](../Images/Simple-Feature-Scaling.png)

### Min-Max
`df['column'] = (df['column'] - df['column'].min())/(df['column'].max()-df['column'].min())`
# ![Min-Max](../Images/Min-Max.png)

### Z-Score
`df['column'] = (df['column']-df['column'].mean())/df['column'].std()`
# ![Z-Score](../Images/Z-Score.png)


## Binning Data

Binning data is grouping of values into "bins". It converts numeric data into categorical variables, e.g. if item prices varies between £5,000 and £50,000, they can be placed into "low", "medium", and "high" priced items, by defining the bin ranges.

To group into `N` bins, we need `N+1` numbers as dividers that are equal distance apart. To do so, we can use the numpy function `linspace` to return the array “bins” that contains `N+1` equally spaced numbers over the specified interval of the price. We can then create a list that contains the different bin names and use the pandas function `cut` to segment and sort the data values into bins. 

So, for the low, medium, high example:

`bins = np.linspace(min(df['price']), max(df['price']), 4)`

`group_names = ['Low', 'Medium', 'High']`

`df['price_binned'] = pd.cut(df['price'] bins, labels=group_names, include_lowest=True)`

# ![Binning](../Images/Binning.png)

Note that histograms are a good visualization for data split into bins.

## One-Hot Encoding

One-hot encoding allows us to go from categorial variables to numerical variables, by adding dummy variables for each unique category and assigning a 0 or 1 in each category. 

To do this with Pandas, use the `pandas.get_dummies()` method, e.g. for a column 'fuel' with values of 'petrol' and 'diesel', the following code will create 2 columns called 'petrol' and 'diesel', and assign 0s and 1s in each row depending on which value appeared in the 'fuel' column.

`pd.get_dummies(df['fuel'])`

# ![One-Hot Encoding](../Images/One-Hot-Encoding.png)

## Grouping Data

### groupby method
The Pandas `dataframe.groupby()` method can be applied on categorical variables, and groups the data into categories. An aggregate function, e.g. `.mean()` can also be applied.
For example:

`df_test = df[['drive_wheels', 'body_style', 'price']]`

`df_grp = df_test.groupby(['drive_wheels', 'body_style'], as_index = False).mean()`

# ![Groupby](../Images/Groupby.png)

Note here that all the 4wd cars are grouped, all the fwd, then all the rwd cars (it is alphabetically ordered).


It can be applied to single or multiple variables. It can be useful in certain scenarios, but is often hard to read, so pivot tables are often used instead.

### pivot method
The Pandas `pivot` method can be used to show 1 variable displayed along the columns and the other variables displayed along the rows, and the syntax is as follows:

`df_pivot = df.pivot(index = 'drive_wheels', columns = 'body_style')`

# ![Pivot](../Images/Pivot.png)

A useful way to visualise this is by using a heatmap, so using matplotlib:

`import matplotlib.pyplot as plt`

`plt.pcolor(df_pivot, cmap = 'RdBu')`

`plt.colorbar()`

`plt.show()`

# ![Heatmap](../Images/Heatmap.png)


