# Data cleansing

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation:

*   loading
*   cleaning
*   transforming
*   rearraging

Such tasks are often reported to take up 80% or more of an analyst’s time.


What can happen with the loading data are
* missing data
* duplicate data
* string manipulation
* other analytical data transformation


## Missing data
---

Some data are imperfect, but it is functional for lot of users

For numeric data,
  Pandas use Nan ``` Nan``` (Not a Number) to represent the number

So call a *sentinel value*



**To detect the missing data**

---


Check by `isNull` value

```python
import pandas as pd
import numpy as np
```



```python
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
```

check for null value

```python
string_data.isnull()
```

The N/A handling method are provided in the data frame as given

![image-20230906053837943](./assets/image-20230906053837943.png)


### Filtering the data
---


The *NA* data can filtering out as shown

```python
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data
```

The `dropna` method will filter out the NA value

```python
data.dropna()
```

the data frame can be also filter

```python
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])
data
```

The `dropna` method will drop all the rows which contains NA value


```python
cleaned = data.dropna()
cleaned
```

**We** can drop only the line which contains **NA** all the rows usign `how='all'`

```python
data.dropna(how='all')
```

**To** drop only columns which contains  all NA, you can use the `axis=1 `parameter

```python
data.dropna(how='all',axis=1)
```

Try to add a new column with NA value
```python
data[4] = NA
data
```

drop na regarding to the axis
```python
data.dropna(axis=1, how='all')
```

Comparing to
```python
data.dropna(axis=1)
```

What are differents of `how='all'` and no `how` parameter

We can drop the rows which may have more data than the given value as we can tolerate for some data missing. `tresh` parameter is required here
```python
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df
```


drop all NA
```python
df.dropna()
```

drop only a line which contains non-NA value more than or equal than the treshold
```python
df.dropna(thresh=2)
```

#### Work



Form the file given in `file/property_data.csv`


Provide the given output

output 1
![image-20230906061553156](./assets/image-20230906061553156.png)

Output 2

![image-20230906061619935](./assets/image-20230906061619935.png)

Output 3

![image-20230906061652010](./assets/image-20230906061652010.png)

### Filling in missing data
---
Instead of filtering data. Default data may be used.
we can fil the default data for all `NA` values.

```python
df
```

```python
df.fillna(0)
```


Or we can define the value for each column
```python
df.fillna({"PID":0.5,"ST_NUM":1.2})
```

`fillna` returns a new object, but we can modify the existing object in-place
```python
df.fillna(0,inplace=True)
df
```


We can use the interpolation method to file the value.

Now we create a new data frame with some `NA`
```python
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df
```

Add the interpolation methods for more details google `dataframe fillna`
```python
df.fill()
```


we can set limit of filling data that it should not be filled more than limit value.
```python
df.ffill(limit=2)
```


Another fill method can be found in the api
such as `bfill` which is backward fill
```python
df.bfill()
```



Try

```python
df[6] = [1,2,3]
df
```

#### Work

---
From the propery_data you have done from the last work
Provide this dataframe

Output 1
![image-20230906063035319](./assets/image-20230906063035319.png)


Output 2
![image-20230906063122466](./assets/image-20230906063122466.png)

Output 3
![image-20230906063412850](./assets/image-20230906063412850.png)


## Data Transformation



We need to transform the data in the format which can manipulate it later.

### Removing Duplicates


Some data duplication must be removed to reduce the unusual behaviour of data
from the provided data

setting up a new dataframe
```python
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data
```

The duplicated data is found by

```python
data.duplicated()
```

So we can drop the duplicate code
```python
data.drop_duplicates()
```

we can select key to check the duplicate data, adding a new column to the data frame to see which data is stored
```python
data['v1'] = range(7)
data
```


now check for duplicatoin
```python
data.duplicated()
```

check the dupplicate key only the key in column k1
```python
data.duplicated(['k1'])
```

The previous work keep the first data in the result data frame. We can keep the last entry of the duplicate data using the `keep` parameter
comparing these code?
```python
data.drop_duplicates(['k1','k2'])
```

with
```python
data.drop_duplicates(['k1','k2'],keep='last')
```

### Transforming Data Using a Function or mapping

Data can be mapped to a better representation

with the provided data frame
```python
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
```

The mapping dictionary is provided
```python
meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}
```

Then we can provide the map as given
```python
lowercased = data['food'].str.lower()
lowercased
```

then mapping the data
```python
data['animal'] = lowercased.map(meat_to_animal)
data
```

### Replacing value
---

Some value can be replaced, in order to manage the code easier.
For example with the given data frame
```python
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data
```


The value -999 may be the **sentinel** values

The sentinel value is the value which define the state of data but not the real data. For example, the exit point or the `NA` of the value.
```python
data.replace(-999, np.nan)
```


In [None]:
import numpy as np

we can replace multiple data with only one value

```python
data.replace([-999,-1000],np.nan)
```

Or replace different value with different data
```python
data.replace([-999,-1000],[np.nan,0])
```

or using the dictionary to map the request data
```python
data.replace({-999: np.nan, -1000: 0})
```

### Renaming Axis Index

The axis index which we loaded from the different sources may be hard to understand.
So we rename the index name for the better understanding
```python
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data
```


We can use the function to change the name of each index
```python
transform = lambda x: x[:4].upper()
new_data_index = data.index.map(transform)
```

To change the index name we have to set the index values
```python
data.index = new_data_index
data
```

Or if we want to rename it instancely (without seeing the new solution first)  we can use the `rename` method
```python
data.rename(index=str.title, columns=str.upper)
```



if we want to change the specific name, using the map for change the index name
```python
data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'})
```

To save the new name directly using `inplace` method
```python
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data
```


#### Work
---

From the previous work provide this output

Output1

fill PID, and PID as an index
![image-20230906065515528](./assets/image-20230906065515528.png)

Output2
Change the column name to your language
![image-20230906065658172](./assets/image-20230906065658172.png)

### Discretization and Bining
---

Extract data and put it in the bins for analysis

The bin is the range of the data that we want to analyze
create a list as
```python
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
```


Then we create a bin
```python
bins = [18, 25, 35, 60, 100]
```

Then we cut the data in to a bin, the cuts represent the bins for each data
```python
cuts = pd.cut(ages, bins)
cuts
```

Instead of using the texts we can see the code (index) of each bins
```python
cuts.codes
```

and the catagories, and the amount of value in each cuts can be shown
```python
cuts.categories
```

We can count the amount of value in each cuts by using `value_counts` method
```python
pd.value_counts(cuts)
```

We can set the name of each cuts by passing the arrays of labels
```python
group_names = ['Youth', 'YoungAdult','MiddleAged','Senior']
pd.cut(ages,bins,labels=group_names)
```

If we pass the integer number instead of the bin edge, it will compute the equal-length bins based on the minimum and maximum values.
```python
data = np.random.rand(20)
pd.cut(data,4,precision=2)
```

Try counting the value of data
```python
pd.value_counts(pd.cut(data,4,precision=2))
```

### Detecting and Filtering Outliers
---


finding the outlining data and filter it out as it may be the error while gathering the data.
Let's start by having the given data
```python
data = pd.DataFrame(np.random.randn(1000,4))
data.describe()
```

if we want to find the values in column 2 which the absolute value is exceeded 3
```python
col = data[2]
col[np.abs(col) >3]
```

If we want to get rows which value is exceeding  3, we can use `any` methods
```python
data[(np.abs(data)>3).any(1)]
```

#### Work
---

From the given property file.
Categorize the size of the hourse as the small house (size is less than 800 sq,ft.),  the medium house (size is between 801-1200 sq.ft) and the large house ( size is more than 1200 sq.ft)

Show the number of each house size

You should ignore the value which is not the number value

# Data Wrangling: Join, Combine, Reshape

## Hierachical Indexing
---


Allow index to have multiple index levels on an axis

Use when working with higher dimensional data ina lower dimensional form


๊Use this setup configuration

In [None]:
import numpy as np
import pandas as pd
pd.options.display.max_rows = 20
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

create multiple index as given
```python
data = pd.Series(np.random.randn(9),
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data
```

We can see the index of the data
```python
data.index
```

we can see data in the multi level index to see only some index
```python
data['b']
```

We can see data from the inner level
```python
data['b':'c']
```

The `loc` method can be used to select data from a particular group of rows and columns in the hierachical index
```python
data.loc[['b','d']]
```

We can see data from the inner level
```python
data.loc[:,2]
```

The slide can be also used
```python
data.loc[:,2]
```

Hierachical indexing play important role in reshaping data and group-based oepration.

The Hierachical index can be rearragen as the DataFrame using `unstack` methods
```python
data.unstack()
```

and the inverse version is `stack`
```python
data.unstack().stack()
```

As the data frame,  axis can have hierachical index
```python
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                     columns=[['Ohio', 'Ohio', 'Colorado'],
                              ['Green', 'Red', 'Green']])
frame
```

The Hierachy level can have names.
```python
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
```

### Work create this data frame

![image-20230907050812483](./assets/image-20230907050812483.png)



## Reordering and Sorting Levels
---

To transform data, we may have to swap the level of the data frame

The `swaplevel` method can be used to swap the level
```python
frame.swaplevel('key1','key2')
```

The index is not sort, so the multi hierachy is not set. we can sort the index at any level to set the better visualization
```python
frame.sort_index(level=1)
```

So swap the key and sort will make the hierarchy better
```python
frame.swaplevel(0,1).sort_index(level=0)
```

The `sort_index` method can be used to sort the index
```python
frame.swaplevel(0,1).sort_index(level=0)
```

## Summary Statistics by Level
---

The statistic data can be calculated with the specific level
For example, if we want to calculate on the rows data

We can calculate the sum of the data by the level
```python
frame.sum()
```

then try
```python
frame.groupby(['key2']).sum()
```

Or calculate on the column-wise
```python
frame.sum(axis=1)
```

then try
```python
frame.groupby(['color'],axis=1).sum()
```

## Indexing with DataFrame's Column


some time we want to use the row index as the columns
```python
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
                      'c': ['one', 'one', 'one', 'two', 'two',
                            'two', 'two'],
                      'd': [0, 1, 2, 0, 1, 2, 3]})
frame
```

In [None]:
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
                      'c': ['one', 'one', 'one', 'two', 'two',
                            'two', 'two'],
                      'd': [0, 1, 2, 0, 1, 2, 3]})
frame

We can create a new data frame using some columns as index

```python
frame2 = frame.set_index(['c','d'])
frame2
```


By default, the column is removed from the source data frame. However, if we do not want to remove the column, the `drop` parameter is passed
```python
frame.set_index(['c','d'],drop = False)
```

### Work



From the given [./file/car_details.xlsx](./file/car_details.xlsx)

provided the hierarchy data frame which carbrand, carmodel, and the model_name TH is a key indexes


Then try to find the average standard price for  each car brand, ignore the `NA` value