<a href="https://colab.research.google.com/github/BraulioHermanson/t_scratch/blob/main/Dealing_with_NaN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dealing with NaN

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
warnings.filterwarnings('ignore')
from IPython.core.pylabtools import figsize


%matplotlib inline


Bad key backend.qt4 in file /etc/matplotlib/matplotlibrc, line 43 ('backend.qt4 : PyQt4        # PyQt4 | PySide')
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.3.4/matplotlibrc.template
or from the matplotlib source distribution


## Open the CSV file

In [None]:
df = pd.read_csv('model.csv',sep=';')

You can use the head(), tail(), info() to start checking your data.

And calling the commands isna() plus any() or isna() plus sum() to identify your NaN values on the dataframe.

In [None]:
df.head()

Unnamed: 0,Data,Sales
0,01/mai,56
1,01/mai,?
2,01/mai,?
3,02/mai,35
4,03/mai,58


In [None]:
df.tail()

Unnamed: 0,Data,Sales
35,27/mai,202
36,28/mai,65
37,29/mai,67
38,30/mai,122
39,31/mai,463


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Data    40 non-null     object
 1   Sales   37 non-null     object
dtypes: object(2)
memory usage: 768.0+ bytes


In [None]:
df.isna().any()

Data     False
Sales     True
dtype: bool

In [None]:
df.isna().sum()

Data     0
Sales    3
dtype: int64

Here we can identify that our command just gives us 3 NaN values, but on the dataset, we have values as '?' and '-'.

So, how to deal with these values?
As the example above:

In [None]:
print(df.shape)
df

(40, 2)


Unnamed: 0,Data,Sales
0,01/mai,56
1,01/mai,?
2,01/mai,?
3,02/mai,35
4,03/mai,58
5,04/mai,64
6,04/mai,98
7,04/mai,
8,04/mai,
9,04/mai,-


## First way

### Change the type of our column 'Sales' to numeric, since it used to be an object type.


In [None]:
df['Sales'] = pd.to_numeric(df['Sales'],errors = 'coerce')

Now our column 'sales' changes to the type float64.

And look that our number of NaN it's now 8 and not 3, as it was.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Data    40 non-null     object 
 1   Sales   32 non-null     float64
dtypes: float64(1), object(1)
memory usage: 768.0+ bytes


In [None]:
df

Unnamed: 0,Data,Sales
0,01/mai,56.0
1,01/mai,
2,01/mai,
3,02/mai,35.0
4,03/mai,58.0
5,04/mai,64.0
6,04/mai,98.0
7,04/mai,
8,04/mai,
9,04/mai,


In [None]:
df.dropna(how ='any',axis=0, inplace=True)

In [None]:
df.shape

(32, 2)

## Second way to remove those data

In [None]:
df1 = pd.read_csv('model.csv',sep=';')

In [None]:
#but with our columns as a type text we must to use a replace form to transform in NaN.

df1 = df1.replace({"?":np.nan,"-":np.nan})

In [None]:
df1.dropna(how ='any',axis=0, inplace=True)

In [None]:
df1.shape

(32, 2)

# Filling the data with values 

### Mean

In [None]:
mean_df = df['Sales'].mean().round(0)
mean_df

193.0

In [None]:
df2 = pd.read_csv('model.csv',sep=';')

In [None]:
df2 = df2.replace({"?":np.nan,"-":np.nan})

In [None]:
df2['Sales'] = df2['Sales'].fillna(mean_df)
df2

Unnamed: 0,Data,Sales
0,01/mai,56
1,01/mai,193
2,01/mai,193
3,02/mai,35
4,03/mai,58
5,04/mai,64
6,04/mai,98
7,04/mai,193
8,04/mai,193
9,04/mai,193


### Or filling with the last value

In [None]:
df3 = pd.read_csv('model.csv',sep=';')

In [None]:
df3 = df3.replace({"?":np.nan,"-":np.nan})

In [None]:
# You can use bffill or ffill.
# In this case I´m using the ffill
df3['Sales'] = df3['Sales'].fillna(method='ffill')
df3

Unnamed: 0,Data,Sales
0,01/mai,56
1,01/mai,56
2,01/mai,56
3,02/mai,35
4,03/mai,58
5,04/mai,64
6,04/mai,98
7,04/mai,98
8,04/mai,98
9,04/mai,98
