In [1]:
# DSC540
# Weeks 5 & 6
# Milestone 2
# Author: Nathanael Ochoa
# 04/21/2024

# Cleaning/Formatting Flat File Source

The data being used is not mine and its source can be found [here](https://www.kaggle.com/datasets/rtatman/did-it-rain-in-seattle-19482017).

In [2]:
# Import packages
#import numpy as np
import pandas as pd

First I'm going to import the data as a data frame and check its type to ensure it's a data frame. I'm also going to preview the data

In [3]:
# Import data as a data frame
srain = pd.read_csv("seattleWeather_1948-2017.csv")

In [4]:
# Type check
type(srain)

pandas.core.frame.DataFrame

In [5]:
# Preview
print(srain)

             DATE  PRCP  TMAX  TMIN   RAIN
0      1948-01-01  0.47    51    42   True
1      1948-01-02  0.59    45    36   True
2      1948-01-03  0.42    45    35   True
3      1948-01-04  0.31    45    34   True
4      1948-01-05  0.17    45    32   True
...           ...   ...   ...   ...    ...
25546  2017-12-10  0.00    49    34  False
25547  2017-12-11  0.00    49    29  False
25548  2017-12-12  0.00    46    32  False
25549  2017-12-13  0.00    48    34  False
25550  2017-12-14  0.00    50    36  False

[25551 rows x 5 columns]


**The following is taken directly from the file's source**
* DATE - the date of the observation
* PRCP - the amount of precipitation, in inches
* TMAX - the maximum temperature for that day, in degrees Fahrenheit
* TMIN - the minimum temperature for that day, in degrees Fahrenheit
* RAIN - TRUE if rain was observed on that day, FALSE if it was not

## Step 1 - Replacing column headers

I'm renaming the columns using the rename() function. I'm choosing names that are a bit longer but help make the data easier to understand. I also named the **DATE** column to **date** since I'm going to add columns for year, month, and day. It will help make analysis based on month and day easier to query.

In [6]:
# Rename headers
srain.rename(columns = {"DATE":"date", "PRCP":"rainfall", "TMAX":"max_temp", "TMIN":"min_temp", "RAIN":"rain"}, 
             inplace = True)

In [7]:
# Preview changes
srain.head()

Unnamed: 0,date,rainfall,max_temp,min_temp,rain
0,1948-01-01,0.47,51,42,True
1,1948-01-02,0.59,45,36,True
2,1948-01-03,0.42,45,35,True
3,1948-01-04,0.31,45,34,True
4,1948-01-05,0.17,45,32,True


## Step 2 - Split the *date* column into its components and create new columns

I'd like extra columns available that only contain the year, month, and day. That way I can easily view statistics based on the separate components of the original **date** column. I may not use it in my analysis but it'll be ready just in case.

In [8]:
# Create the new columns using str.split()
srain[["date_year", "date_month", "date_day"]] = srain["date"].str.split("-", expand = True)

In [9]:
# Preview changes
srain.head()

Unnamed: 0,date,rainfall,max_temp,min_temp,rain,date_year,date_month,date_day
0,1948-01-01,0.47,51,42,True,1948,1,1
1,1948-01-02,0.59,45,36,True,1948,1,2
2,1948-01-03,0.42,45,35,True,1948,1,3
3,1948-01-04,0.31,45,34,True,1948,1,4
4,1948-01-05,0.17,45,32,True,1948,1,5


In [10]:
# An example of using the components
srain.query("date_month == '06' and date_year == '2005'")

Unnamed: 0,date,rainfall,max_temp,min_temp,rain,date_year,date_month,date_day
20971,2005-06-01,0.03,66,51,True,2005,6,1
20972,2005-06-02,0.0,65,52,False,2005,6,2
20973,2005-06-03,0.0,61,50,False,2005,6,3
20974,2005-06-04,0.0,64,50,False,2005,6,4
20975,2005-06-05,0.01,62,50,True,2005,6,5
20976,2005-06-06,0.0,61,49,False,2005,6,6
20977,2005-06-07,0.1,60,51,True,2005,6,7
20978,2005-06-08,0.03,62,50,True,2005,6,8
20979,2005-06-09,0.0,69,52,False,2005,6,9
20980,2005-06-10,0.0,65,50,False,2005,6,10


## Step 3 - Reformat the *date* column

I'd like the **date** column to be in the format mm-dd-yyyy. I think it's a lot more natural to read and I'll be concatenating the columns created in step 2 to make this change.

In [11]:
# Variable containing the concatenated 'date_day' and 'date_year'
dandy = srain["date_day"] + "-" + srain["date_year"]

In [12]:
# Now concatenate the 'month' column and override the 'date' column
srain["date"] = srain["date_month"].str.cat(dandy, sep = "-")

In [13]:
# Preview changes
srain.head()

Unnamed: 0,date,rainfall,max_temp,min_temp,rain,date_year,date_month,date_day
0,01-01-1948,0.47,51,42,True,1948,1,1
1,01-02-1948,0.59,45,36,True,1948,1,2
2,01-03-1948,0.42,45,35,True,1948,1,3
3,01-04-1948,0.31,45,34,True,1948,1,4
4,01-05-1948,0.17,45,32,True,1948,1,5


## Step 4 - Check for any duplicates in the *date* column

The only column that shouldn't be duplicated is the **date** column. Every other column should have duplicates, that's perfectly normal for this data.

In [14]:
# Check for duplicates
print("The 'date' column contains duplicates - {}".format(any(srain.date.duplicated())))

The 'date' column contains duplicates - False


## Step 5 - Check for any NaN values in the data

It's not uncommon for datasets to come with NaN values or errors so it's always a good idea to check for them before starting the analysis.

In [15]:
# Check for NaN values
print("The 'date' column contains NaN - %s" % srain.date.isnull().values.any())
print("The 'rainfall' column contains NaN - %s" % srain.rainfall.isnull().values.any())
print("The 'max_temp' column contains NaN - %s" % srain.max_temp.isnull().values.any())
print("The 'min_temp' column contains NaN - %s" % srain.min_temp.isnull().values.any())
print("The 'rain' column contains NaN - %s" % srain.rain.isnull().values.any())
print("The 'date_year' column contains NaN - %s" % srain.date_year.isnull().values.any())
print("The 'date_month' column contains NaN - %s" % srain.date_month.isnull().values.any())
print("The 'date_day' column contains NaN - %s" % srain.date_day.isnull().values.any())

The 'date' column contains NaN - False
The 'rainfall' column contains NaN - True
The 'max_temp' column contains NaN - False
The 'min_temp' column contains NaN - False
The 'rain' column contains NaN - True
The 'date_year' column contains NaN - False
The 'date_month' column contains NaN - False
The 'date_day' column contains NaN - False


The **rainfall** and **rain** columns contain NaN values. I'll count how many for comparison.

In [16]:
# Count the NaN values in 'rainfall'
srain["rainfall"].isnull().sum()

3

In [17]:
# Count the NaN values in 'rain'
srain["rain"].isnull().sum()

3

In [18]:
# Query NaN rows in the data
srain[srain["rainfall"].isnull()]

Unnamed: 0,date,rainfall,max_temp,min_temp,rain,date_year,date_month,date_day
18415,06-02-1998,,72,52,,1998,6,2
18416,06-03-1998,,66,51,,1998,6,3
21067,09-05-2005,,70,52,,2005,9,5


The **rain** NaN values correspond with the same values in the **rainfall** column. This makes plenty of sense and only 3 rows were returned which means there are no extra hidden NaN values in the data. 

I'm currently unsure what to do about the empty values. I could use previous data and predict the values which would be very interesting to do but I'd have nothing to compare it to. I'd want to compare to the *actual* values but without the true values I'm not able to do this. I managed to find the source of the data [here](https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00024233/detail) and those 3 dates have no reported observations. I may end up dropping them altogether or make it part of the analysis.

# Step 6 - Check column data types

I want to check the data types of the columns and change any if need be. The **date_year**, **date_month**, and **date_day** columns for example are integer values so I can change their data types or leave as is.

In [19]:
# Check column data types
print("'date' data type - {}".format(srain["date"].dtype))
print("'rainfall' data type - {}".format(srain["rainfall"].dtype))
print("'max_temp' data type - {}".format(srain["max_temp"].dtype))
print("'min_temp' data type - {}".format(srain["min_temp"].dtype))
print("'rain' data type - {}".format(srain["rain"].dtype))
print("'date_year' data type - {}".format(srain["date_year"].dtype))
print("'date_month' data type - {}".format(srain["date_month"].dtype))
print("'date_day' data type - {}".format(srain["date_day"].dtype))

'date' data type - object
'rainfall' data type - float64
'max_temp' data type - int64
'min_temp' data type - int64
'rain' data type - object
'date_year' data type - object
'date_month' data type - object
'date_day' data type - object


I'm actually not going to convert the **date_year**, **date_month**, and **date_day** columns into integer types since they represent the date. The other numerical columns are of the correct type and there is no problem with the other columns so I'm going to leave it as is.

# Step 7 - Rearranging column order

I want the columns to be in an easy to read order. Reading the headers left to right should make sense. I want the new order to be: **date**, **date_month**, **date_day**, **date_year**, **min_temp**, **max_temp**, **rain**, **rainfall**.

In [20]:
# Override the data with the new column order
srain = srain[["date", "date_month", "date_day", "date_year", "min_temp", "max_temp", "rain", "rainfall"]]

In [21]:
# View final dataset
srain

Unnamed: 0,date,date_month,date_day,date_year,min_temp,max_temp,rain,rainfall
0,01-01-1948,01,01,1948,42,51,True,0.47
1,01-02-1948,01,02,1948,36,45,True,0.59
2,01-03-1948,01,03,1948,35,45,True,0.42
3,01-04-1948,01,04,1948,34,45,True,0.31
4,01-05-1948,01,05,1948,32,45,True,0.17
...,...,...,...,...,...,...,...,...
25546,12-10-2017,12,10,2017,34,49,False,0.00
25547,12-11-2017,12,11,2017,29,49,False,0.00
25548,12-12-2017,12,12,2017,32,46,False,0.00
25549,12-13-2017,12,13,2017,34,48,False,0.00


There shouldn't be any ethical implications after the data transformations I made to the flat file data. I replaced the column headers, created new columns that contain the components of the **date** column (i.e. month, day, year), reformatted the date to be mm-dd-yyyy, checked for any duplicates in the **date** column (the only column where duplicates would've been problematic), checked for any empty values in the data, checked the column data types, and rearranged the column order in the data. I have not claimed any of the data to be mine and have included the source of the data at the top of this file. The person who put the data together on [Kaggle.com](https://www.kaggle.com/) also included the following disclaimer "This dataset was compiled by NOAA and is in the public domain". Therefore the data was collected in an ethical way and there are no current legal implications.

In [23]:
# Download as CSV to use in Milestone 5
srain.to_csv("M5_flat_file.csv")