## Data Cleaning

Let's load the some data for TSLA. Unfortunately, this data is not quite as *clean* as our NVDA data, so we'll need to do some data wrangling. The file we're looking to load is `TSLA_2015_2024.csv`.

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv("data/TSLA_2015_2024.csv")
df

Unnamed: 0,Date,Close,High,Low,Open,Volume
0,11/07/2024,241.029999,271.000000,239.649994,263.299988,221707300.0
1,13/08/2020,108.066666,110.078667,104.484001,107.400002,306379500.0
2,13/08/2020,108.066666,110.078667,104.484001,107.400002,306379500.0
3,30/10/2019,21.000668,21.252666,20.664667,20.866667,144627000.0
4,27/08/2015,16.199333,,15.387333,15.400000,114840000.0
...,...,...,...,...,...,...
2561,10/08/2018,23.699333,24.000000,23.066668,23.600000,173280000.0
2562,21/06/2021,206.943329,210.463333,202.960007,208.160004,74438100.0
2563,20/06/2016,14.646667,14.916667,14.548667,14.633333,53332500.0
2564,21/02/2019,19.415333,20.216000,19.366667,20.120667,133638000.0


Can you see what we mean by messy? How many issues can you spot?

- Dates out of order
- Duplicate rows
- Missing values

## Ordering and Duplicates

First let's start with sorting the index.

In [5]:
df.index.is_monotonic_increasing

df.sort_index(inplace=True)

df.index.is_monotonic_increasing

True

Now let's focus on duplicates:

#### Tip: Method Chaining

**Method chaining** is a popular feature of pandas. It allows us to *chain* together several operations in a single line of code. For example, we can set the index, sort the data frame and drop any duplicates all at once. Notice we don't use `inplace` but rather re-assign to the original `df` variable.

```python
df = df.set_index("Date").sort_index().drop_duplicates()
```

## Not a Number (NaN)

### Exercise: Some Null Chain

Let's look at the missing or `NaN` values next. Previously, we saw that `info()` gave us some insight into how many missing values we had, but we can also use `isnull()`.

Can you chain `isnull()` with `sum()` to get a single value stating the total number of missing values in the data frame?

In [6]:
## YOUR CODE GOES HERE
df.isnull()



Unnamed: 0,Date,Close,High,Low,Open,Volume
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,True,False,False,False
...,...,...,...,...,...,...
2561,False,False,False,False,False,False
2562,False,False,False,False,False,False
2563,False,False,False,False,False,False
2564,False,False,False,False,False,False


We can find out which rows have missing data using `isnull()`, `any()` along rows and some smart *masking*.

Now that we've identified our missing values, the big question is how to handle them. There are many approaches to this that will vary depending on the data and the further analysis you plan to carry out.

### Exercise: Cleaning up

Notice how above we didn't actually update the `df` variable, so our DataFrame is still full of missing values. Fix all missing values applying the following rules:
- Fill missing Close by linear interpolation
- Fill missing Volume with the value from the day before
- Fill missing Open with the median Open
- Fill missing High with the Close or Open, whichever is higher
- Fill missing Low with a value 3% lower than the High


Your DataFrame `df` should have no missing values when done. Use `info()` to confirm.

**NOTE:** When changing values in a data frame, it is recommended to avoid using `inplace`, and instead re-assign the variable.

In [7]:
## YOUR CODE GOES HERE
df.Close = df.Close.interpolate(method="linear")

df.Volume = df.Volume.ffill()

df.Open = df.Open.fillna(df.Open.median())

df.High = df.High.fillna(df[["Close", "Open"]].max(axis=1))

df.Low = df.Low.fillna (df.High*0.97)

df.isnull().sum().sum()



np.int64(0)

#### Advanced: Data Types

You may have noticed that the **Volume** column in the 2021 data frame is a `float64` instead of the `int64` *dtype* we had in the 2020 data frame. Missing values (NaN) are represented as a special case of floating point number, so all the values in **Volume** were automatically *upcast* to floats.

Ideally our columns should be of the *dtype* that most accurately represents them. This will improve performance when working with large data frames. Now that we've resolved our missing numbers, we can *cast* our trading volumes as integers.

In [8]:
df["Volume"] = df["Volume"].astype("int64")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2566 entries, 0 to 2565
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    2566 non-null   object 
 1   Close   2566 non-null   float64
 2   High    2566 non-null   float64
 3   Low     2566 non-null   float64
 4   Open    2566 non-null   float64
 5   Volume  2566 non-null   int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 120.4+ KB


## Saving Data

Now that we've cleaned our data, let's save it, by writing it to a new .CSV file. We can use pandas' `to_csv()`.

In [9]:
df.to_csv("TSLA_10_clean.csv")