# Data Wrangling

This is an essential initial phase in data analysis, akin to preparing raw materials before manufacturing. It encompasses various tasks such as cleaning, organizing, and transforming data into a structured format suitable for analysis. **Referred to as data cleaning or data preproccessing**, this process ensures that the data is accurate, complete, and formatted appropriately for subsequent analysis procedures. In essence, data preprocessing lays the foundation for meaningful insights to be derived from the dataset.


# Contents

1. Identifying and Handling Missing Values
2. Data Formatting
3. Data Normalization
4. Data Binning
5. Turning Categorical to Numerical Values

# 1. Identifying and Handling Missing Values

A missing value in a dataset can occur as **?, NA, 0 or a blank cell.**

There are several solutions for missing data.
1. Check with the data collection source
- Drop the variable
- Drop the data entry

2. Drop the missing value 
- Be careful not to drastically impact the dataset that it skews the results

3. Replace the missing value
- replace missing values by the average value of the entire variable
- replace by frequency of variable

4. Leave it as missing data

## dropna() method

It is important to drop NaN (Not a Number) objects in Pandas because they represent **missing or undefined values** in the data. NaN values can affect the accuracy and reliability of your data analysis.

The <code>dropna()</code> method in pandas is used to remove missing values (NaN, null values) from a DataFrame or Series object. It provides flexibility in terms of which **axis (rows or columns)** to consider for dropping, as well as the **threshold for the number of missing values required to trigger dropping.**

**Axis** specifies whether to drop rows <code>(axis=0)</code> or columns <code>(axis=1)</code> that contain missing values. **By default, it's set to 0, meaning it drops rows.**

Setting the argument <code>inplace</code> to <code>True</code> allows the modification to be done on the data set directly, <code>inplace = True</code> just writes the result back into the data frame.

In [None]:
# drop missing values along the column "x" as follows
# df = df1.dropna(subset = ["x"], axis = 0, inplace = True)

## replace() method

Pandas, a library in Python for handling data, includes a useful method called Replace. This method helps to fill in missing values (like NaNs) in your dataset with new values.

It can be replaced by a mean value. In Python, it can be calculated with the <code>mean()</code>method by the average of the entries within that specific column.

In [None]:
# new_value = df['column_name'].mean()
# df.replace(missing_value,new_value)

# 2. Data Formatting

It is sometimes unavoidable that the data is written into different formats as it is collected from different places by different people.

This is where data formatting comes in. **Data formatting** is like putting all your information into a common language everyone can understand. It's a step in cleaning up your dataset where you make sure everything looks the same and makes sense. This consistency helps people compare and analyze the data without any confusion.

## Incorrect Data Types
A wrong data type assigned to a feature. It is crucial during later stages of analysis to examine the data types of features and ensure they are converted to the appropriate types. Failure to do so could result in unexpected behavior of the developed models, potentially treating perfectly valid data as if it were missing.

In [None]:
# Identify the data type
#df.dtypes()

# Convert the data type to integer in column'amount' 
#df.astype()
#df['amount'] = df['amount'].astype("int")

# 3. Data Normalization

**Data normalization** is a way to make sure all the numbers in a dataset are in a similar range. It helps us compare different pieces of data more easily. We adjust the numbers so they're not too big or too small. This makes it simpler to understand and analyze the information.

This is an example of a normalized data

![image.png](attachment:image.png)
Figure 1

Another sample showing normalization.

**Before Normalization**
![image-2.png](attachment:image-2.png)
Figure 2.1

**After Normalization** (scaled between 0 and 1)
![image-3.png](attachment:image-3.png)
Figure 2.2


## Methods of Normalizing Data

![image-4.png](attachment:image-4.png)
Figure 3

![image-5.png](attachment:image-5.png)
Figure 4



### Simple Feature Scaling in Pandas

From the Figure 1 Table, this can be done in one line of code using simple feature scaling in Pandas.

![image.png](attachment:image.png)
Figure 5

In [None]:
# Simple Feature Scaling in Pandas
# df['length'] = df['length']/df['length'].max()

### Min-Max in Pandas

From the Figure 1 Table, this can be done in one line of code using min-max in Pandas.
![image.png](attachment:image.png)
Figure 6

In [None]:
# min-max version 1 for the length column
# df['Length_MinMax'] = (df['length'] - # df['length'].min()) / (df['length'].max() - df['length'].min())

# min-max version 2 for the length column
# min_value = df['Length'].min()
# max_value = df['Length'].max()
# df['Length_MinMax'] = (df['Length'] - min_value) / (max_value - min_value)

### Z-score in Pandas

From the Figure 1 Table, this can be done in one line of code using z-score in Pandas.

![image.png](attachment:image.png)
Figure 7

In [None]:
# Z-score
#df['Length_ZScore'] = (df['length'] - df['length'].mean())/df['length'].std()

# 4. Data Binning

Binning means putting similar things together in groups. For instance, you might put ages into groups like 0-5 years old, 6-10 years old, and 11-15 years old. This can help us understand the data better. Sometimes, when we're trying to predict things with numbers, putting them into bins can make our predictions more accurate.  This helps you see patterns and trends in your data more clearly.  

Example of how binning looks like. Assuming we have the following data for car prices:
![image.png](attachment:image.png)
Figure 8

We want to bin these prices into three bins: Low, Medium, and High.
![image-2.png](attachment:image-2.png)
Figure 9

In [None]:
# bins = np.linspace(min(df['price']),max(df['price']),4)
# group_names = ['Low', 'Medium', 'High']
# df['price_binned'] = pd.cut(df['price'], bins, labels = group_names, include_lowest = True)

# 5. Turning Categorical Values to Numerical Values

The majority of statistical models are designed to process numerical inputs rather than objects or strings. Therefore, when training these models, only numerical data is accepted as input.

![image-2.png](attachment:image-2.png)
Figure 10

Here is an example problem. Suppose we have the following data for housing types: 1 bedroom apartment and 2 bedroom apartment. After applying one-hot encoding, where we create new features for each unique housing type:

### One-hot encoding

![image.png](attachment:image.png)
Figure 11

The solution can be using one-hot encoding. New columns/features for each unique housing type is created. It's done by adding dummy variables for each unnique feature. If a house belongs to a particular housing type, its corresponding feature is set to 1, while the other features are set to 0.

### Dummy Variables in Pandas
Use pandas <code>get.dummies()</code> method to convert categorical (objects) variables to dummy variables.

In [None]:
pd.get_dummies(df['Housing_Type'])