# Initial Data Cleaning 

Often your dataset has to be cleaned before you can do any analysis on it.  Some common data cleaning activities are included below.


In [1]:
import pandas as pd
datafile = "C:/Users/Nancy/Documents/Python Scripts/weather.csv"
df = pd.read_csv(datafile, index_col=2,parse_dates=True)

### Column Names

Sometimes the dataset has more data than what you need.  Remove any excess data. 

In [7]:
df.columns

Index(['STATION', 'NAME', 'DAPR', 'MDPR', 'PRCP', 'SNOW', 'SNWD', 'TAVG',
       'TMAX', 'TMIN', 'TOBS', 'WT01', 'WT02', 'WT06', 'WT08', 'WT09'],
      dtype='object')

In [10]:
df.drop(axis=1, columns=['WT01', 'WT02', 'WT06', 'WT08', 'WT09'])

Unnamed: 0_level_0,STATION,NAME,DAPR,MDPR,RAIN,SNOW,SNWD,TAVG,TMAX,TMIN,TOBS
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2022-01-01,US1NJOC0058,"STAFFORD TWP 2.8 NNW, NJ US",,,0.17,,,,,,
2022-01-02,US1NJOC0058,"STAFFORD TWP 2.8 NNW, NJ US",,,0.36,,,,,,
2022-01-03,US1NJOC0058,"STAFFORD TWP 2.8 NNW, NJ US",,,0.02,0.0,0.0,,,,
2022-01-04,US1NJOC0058,"STAFFORD TWP 2.8 NNW, NJ US",,,0.84,8.0,8.0,,,,
2022-01-05,US1NJOC0058,"STAFFORD TWP 2.8 NNW, NJ US",,,0.02,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
2022-01-24,US1NJOC0090,"PINE BEACH 0.4 NW, NJ US",,,0.00,0.0,0.0,,,,
2022-01-25,US1NJOC0090,"PINE BEACH 0.4 NW, NJ US",,,0.00,0.0,0.0,,,,
2022-01-26,US1NJOC0090,"PINE BEACH 0.4 NW, NJ US",,,0.00,0.0,0.0,,,,
2022-01-27,US1NJOC0090,"PINE BEACH 0.4 NW, NJ US",,,0.00,0.0,0.0,,,,


#### Column names are sometimes cryptic, long, or difficult to work with.  Renaming them makes working with them easier.

In [11]:
df.rename(
    columns=({ 'PRCP': 'RAIN'}), 
    inplace=True,
)
df.columns

Index(['STATION', 'NAME', 'DAPR', 'MDPR', 'RAIN', 'SNOW', 'SNWD', 'TAVG',
       'TMAX', 'TMIN', 'TOBS', 'WT01', 'WT02', 'WT06', 'WT08', 'WT09'],
      dtype='object')

In [13]:
df.rename(columns=str.lower).head()

Unnamed: 0_level_0,station,name,dapr,mdpr,rain,snow,snwd,tavg,tmax,tmin,tobs,wt01,wt02,wt06,wt08,wt09
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2022-01-01,US1NJOC0058,"STAFFORD TWP 2.8 NNW, NJ US",,,0.17,,,,,,,,,,,
2022-01-02,US1NJOC0058,"STAFFORD TWP 2.8 NNW, NJ US",,,0.36,,,,,,,,,,,
2022-01-03,US1NJOC0058,"STAFFORD TWP 2.8 NNW, NJ US",,,0.02,0.0,0.0,,,,,,,,,
2022-01-04,US1NJOC0058,"STAFFORD TWP 2.8 NNW, NJ US",,,0.84,8.0,8.0,,,,,,,,,
2022-01-05,US1NJOC0058,"STAFFORD TWP 2.8 NNW, NJ US",,,0.02,,,,,,,,,,,


In [14]:
df.columns

Index(['STATION', 'NAME', 'DAPR', 'MDPR', 'RAIN', 'SNOW', 'SNWD', 'TAVG',
       'TMAX', 'TMIN', 'TOBS', 'WT01', 'WT02', 'WT06', 'WT08', 'WT09'],
      dtype='object')

## Missing data
Datasets often contain missing data. pandas has many functions to easily process data that is NaN (not a number).

In [2]:
import statistics as stats

In [3]:
stats.mean(df['PRCP'])

nan

In [5]:
df['PRCP'] = df['PRCP'].fillna(0)

In [6]:
stats.mean(df['PRCP'])

0.11419207796290474