<center>
<img src="https://www.iybssd2022.org/wp-content/uploads/ASAQ.jpg" width="150"/> 
</center>

        
<center>
<h1><font color= "blue" size="+2">ASAQ Python Data Analysis Courses</font></h1>
</center>

---

<center><h1><font color="blue" size="+2">Data Cleaning and Conversion</font></h1></center>

## <font color="red">Objectives</font>

We want to:

- Read a cvs file.
- Inspect the rows and columns
- Identify missing values and do cleaning
- Perform data conversion
- Perform basic plots.

## <font color="red">Required modules/packages</font>

- `pandas`: 

In [None]:
#!pip install skimpy
#!pip install plotly

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import matplotlib.pyplot as plt

In [None]:
import pandas as pd

import skimpy

import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
pio.templates.default = "plotly_white"

print(f"Pandas version: {pd.__version__}")

## <font color="red">Data Access</font>

File name:

In [None]:
file_name = "AirQuality.csv"
data_url = f"https://github.com/JulesKouatchou/asaq_py/raw/main/sample_data/{file_name}"
data_url = "/".join(["../sample_data", file_name])

## <font color="red">Read the file</font>

- We use `Pandas` to read the Excel file
- We obtain a `DataFrame` that is seen as data organized in labeled rows and columns.
  - Each row is a considered as a data point.
  - Each column can be seen for instance as a the set of latitudes or measurements of a specific field.
     - All the values of a given column are of the same data type (integer, float, boolean)
     - Each colunm is in fact a `NumPy` array.
- A `DataFrame` is a collection of one-dimensional `NumPy` arrays.

In [None]:
df = pd.read_csv(data_url, sep=";")

In [None]:
type(df)

In [None]:
df

#### Quick observations
- There are 17 labeled columns
   - The first two columns appear to be related to the date and time
   - The remaining columns have measurement related data
- There are 9471 rows (data points)
   - Each row has an index, 0 to 9470
   - Each data point consists of 17 values.
- There are many missing values.
   - What are we going to do with missing values?

## <font color="red"> Dealing with missing values</font>

When we identify the missing values, we typically have at least three options:

- Droping the missing values
- Filling missing values
- Perform data intepolations to replace missing values.

### <font color="blue">Identify the columns with missing values</font>

In [None]:
df.isnull().sum()

__Observations__

- All the columns have meissing values.
- The last two columns only have missing values.

We can also compute the number of non-missing values per columns.

In [None]:
df.notnull().sum()

### <font color="blue">Dropping missing values</font>

`dropna()`: Removes rows or columns containing missing values.
- `df.dropna(axis=0)`: Drops rows with missing values.
- `df.dropna(axis=1)`: Drops columns with missing values.
- `df.dropna(how='all')`: Drops rows where all values are missing.
- `df.dropna(thresh=2)`: Drops rows with less than 2 non-null values.

__In our example, we will drop columns and rows that only have missing values.__

Remove rows with only missing values:

In [None]:
df.dropna(how='all', inplace=True)

In [None]:
df

Remove columns with only missing values:

In [None]:
df.dropna(axis=1, inplace=True)

df

In [None]:
df.info()

__Observations__

- There are now 15 columns and 9357 rows.
- There are more likely no more missing values.
- The data type of the values of some of the columns is `object`:
   - We need to pay attention and do data conversion if necessary.

### <font color="blue">Other options for dealing with missing values</font>

__Filling missing values__:

`fillna()`: Fills missing values with a specified value or method.
- `df.fillna(0)`: Fills missing values with 0.
- `df.fillna(method='ffill')`: Fills missing values with the last non-null value (forward fill).
- `df.fillna(method='bfill')`: Fills missing values with the next non-null value (backward fill).
- `df.fillna(df.mean())`: Fills missing values with the mean of each column.

__Data interpolation__:

`interpolate()`: Estimates missing values using interpolation methods.
- `df.interpolate(method='linear')`: Linear interpolation.
- `df.interpolate(method='time')`: Time-based interpolation.

None of the two options is needed here.

## <font color="red">Data Conversion</font>

### Combine the first two columns into a `datetime` object

Write the values if the `Time` column as `HH:MM:SS` but not `HH.MM.SS`.

In [None]:
df['Time'] = df['Time'].str.replace('.', ':')

In [None]:
df

Combine the two columns:

In [None]:
df['t'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])

df

Drop the original `Date` and `Time` columns:

In [None]:
df.drop(['Date', 'Time'], axis=1, inplace=True)

In [None]:
df

In [None]:
df.info()

### Replace commas (`,`) with dots (`.`)

- In the columns `CO(GT)`, `C6H6(GT)`, `T`, `RH` and `AH` the "numbers" with deimals are represented with commas.
- We need to convert from the European system to the American one.

In [None]:
df[["CO(GT)", "C6H6(GT)", "T", "RH", "AH"]] = df[["CO(GT)", "C6H6(GT)", "T", "RH", "AH"]].replace(",", ".", regex=True)

In [None]:
df.head(5)

In the columns `CO(GT)`, `C6H6(GT)`, `T`, `RH` and `AH`, convert values from strings to floats.

In [None]:
df[["CO(GT)", "C6H6(GT)", "T", "RH", "AH"]] = df[["CO(GT)", "C6H6(GT)", "T", "RH", "AH"]].astype(float)

In [None]:
df.info()

### <font color="blue"> Obtain descriptive statistics of each numeric column</font>

In [None]:
df.describe().T

In [None]:
skimpy.skim(df)

In [None]:
df.set_index('t', inplace=True)

In [None]:
mat = px.imshow(df.corr(), x=df.columns, 
                 y=df.columns, 
                title="Correlation matrix", 
                width=600, height=600)
mat.show()

In [None]:
fig, ax = plt.subplots(figsize=(6,10))
df.corr()['C6H6(GT)'].sort_values().to_frame().drop('C6H6(GT)').plot.barh(ax=ax)

In [None]:
corr_C6H6 = df.corr()['C6H6(GT)']

In [None]:
fields = list(corr_C6H6[corr_C6H6>0.5].index)
fields

In [None]:
field_colors = px.colors.qualitative.Plotly

In [None]:
total_concentrations = df[fields].sum()

In [None]:
concentration_data = pd.DataFrame({
    "Field": fields,
    "Concentration": total_concentrations
})

In [None]:
fig = px.pie(concentration_data, names="Field", values="Concentration",
             title="Field Concentrations",
             hole=0.4, color_discrete_sequence=field_colors,
            width=500, height=500)

# Update layout for the donut plot
fig.update_traces(textinfo="percent+label")
fig.update_layout(legend_title="Field")

### <font color="blue">Basic plots</font>

In [None]:
df.plot(x="t", y="T")