## First Sprint (10/27 deadline):
Issues:
- Find Data and Note Source     -done
- Read in Data                  -done
- Convert Data to a Usable Format  -done
- Build Ways to Handle Errors   -done
- Get General Info from Data    -done
## Second Sprint (11/19 deadline)
Issues:
- Deal with Nulls     -done
- Clean Up Code       -done
- Feature Engineering
- Clean Your Dataset

## Issue: Find Data and Note Source
Data set #1: 2024 Louisville Daily Max and Min Temperatures

Data: 4147807.csv

Source: https://www.ncdc.noaa.gov/cdo-web/search

Data set #2: 2024 Louisville Daily Ozone Readings

Data: ad_viz_plotval_data.csv

Source: https://www.epa.gov/outdoor-air-quality-data/download-daily-data

## Issue: Read in Data

In [26]:
import pandas as pd 

Reading in temperature data set:

In [27]:
temp_df = pd.read_csv("4147807.csv")
temp_df.head()

Unnamed: 0,STATION,NAME,DATE,PRCP,SNOW,SNWD,TMAX,TMIN,TOBS
0,USC00154958,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-01,0.0,0.0,0.0,37,31,35
1,USC00154958,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-02,0.0,0.0,0.0,40,29,29
2,USC00154958,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-03,0.0,0.0,0.0,42,23,35
3,USC00154958,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-04,0.0,0.0,0.0,41,27,27
4,USC00154958,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-05,0.0,0.0,0.0,44,22,39


Reading in ozone data set:

In [28]:
ozone_df = pd.read_csv("ad_viz_plotval_data.csv")
ozone_df.head()

Unnamed: 0,Date,Source,Site ID,POC,Daily Max 8-hour Ozone Concentration,Units,Daily AQI Value,Local Site Name,Daily Obs Count,Percent Complete,...,AQS Parameter Description,Method Code,CBSA Code,CBSA Name,State FIPS Code,State,County FIPS Code,County,Site Latitude,Site Longitude
0,01/01/2024,AQS,180190008,1,0.018,ppm,17,Charlestown State Park- 1051.8 meters East of ...,17,100.0,...,Ozone,47,31140,"Louisville/Jefferson County, KY-IN",18,Indiana,19,Clark,38.393822,-85.664118
1,01/02/2024,AQS,180190008,1,0.022,ppm,20,Charlestown State Park- 1051.8 meters East of ...,17,100.0,...,Ozone,47,31140,"Louisville/Jefferson County, KY-IN",18,Indiana,19,Clark,38.393822,-85.664118
2,01/03/2024,AQS,180190008,1,0.024,ppm,22,Charlestown State Park- 1051.8 meters East of ...,17,100.0,...,Ozone,47,31140,"Louisville/Jefferson County, KY-IN",18,Indiana,19,Clark,38.393822,-85.664118
3,01/04/2024,AQS,180190008,1,0.025,ppm,23,Charlestown State Park- 1051.8 meters East of ...,17,100.0,...,Ozone,47,31140,"Louisville/Jefferson County, KY-IN",18,Indiana,19,Clark,38.393822,-85.664118
4,01/05/2024,AQS,180190008,1,0.024,ppm,22,Charlestown State Park- 1051.8 meters East of ...,17,100.0,...,Ozone,47,31140,"Louisville/Jefferson County, KY-IN",18,Indiana,19,Clark,38.393822,-85.664118


## Issues: Covert Data to a Usable Format, and Build Ways to Handle Errors:

### Temperature data:

Checking for null values:

In [29]:
temp_df.isnull().sum()


STATION    0
NAME       0
DATE       0
PRCP       0
SNOW       0
SNWD       0
TMAX       0
TMIN       0
TOBS       0
dtype: int64

In [30]:
temp_df.isnull().values.any()

np.False_

No nulls in temperature data.

Converting date from object to datetime:

In [31]:
temp_df['DATE'] = pd.to_datetime(temp_df['DATE'])


In [32]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   STATION  366 non-null    object        
 1   NAME     366 non-null    object        
 2   DATE     366 non-null    datetime64[ns]
 3   PRCP     366 non-null    float64       
 4   SNOW     366 non-null    float64       
 5   SNWD     366 non-null    float64       
 6   TMAX     366 non-null    int64         
 7   TMIN     366 non-null    int64         
 8   TOBS     366 non-null    int64         
dtypes: datetime64[ns](1), float64(3), int64(3), object(2)
memory usage: 25.9+ KB


Renaming columns, and removing unneeded columns:

In [33]:
temp_df.columns

Index(['STATION', 'NAME', 'DATE', 'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN',
       'TOBS'],
      dtype='object')

In [34]:
temp_df = temp_df.rename(columns = {'TMAX': 'Max_Temp', 'TMIN': 'Min_Temp', 'NAME': 'Station_Name', 'DATE': 'Date'})


In [35]:
temp_df.drop(['STATION', 'PRCP', 'SNOW', 'SNWD', 'TOBS'], axis=1, inplace=True)

In [36]:
temp_df.head()

Unnamed: 0,Station_Name,Date,Max_Temp,Min_Temp
0,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-01,37,31
1,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-02,40,29
2,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-03,42,23
3,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-04,41,27
4,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-05,44,22


Checking for duplicated rows(even though it looks like one row for each day with 365 rows):

In [37]:
temp_df[temp_df['Date'].duplicated()]

Unnamed: 0,Station_Name,Date,Max_Temp,Min_Temp


### Ozone data:

Checking for null values:

In [38]:
ozone_df.isnull().sum()

Date                                      0
Source                                    0
Site ID                                   0
POC                                       0
Daily Max 8-hour Ozone Concentration      0
Units                                     0
Daily AQI Value                           0
Local Site Name                         341
Daily Obs Count                           0
Percent Complete                          0
AQS Parameter Code                        0
AQS Parameter Description                 0
Method Code                               0
CBSA Code                                 0
CBSA Name                                 0
State FIPS Code                           0
State                                     0
County FIPS Code                          0
County                                    0
Site Latitude                             0
Site Longitude                            0
dtype: int64

In [39]:
ozone_df[ozone_df['Local Site Name'].isnull()]

Unnamed: 0,Date,Source,Site ID,POC,Daily Max 8-hour Ozone Concentration,Units,Daily AQI Value,Local Site Name,Daily Obs Count,Percent Complete,...,AQS Parameter Description,Method Code,CBSA Code,CBSA Name,State FIPS Code,State,County FIPS Code,County,Site Latitude,Site Longitude
344,01/01/2024,AQS,180430008,1,0.017,ppm,16,,17,100.0,...,Ozone,47,31140,"Louisville/Jefferson County, KY-IN",18,Indiana,43,Floyd,38.317813,-85.833322
345,01/03/2024,AQS,180430008,1,0.024,ppm,22,,17,100.0,...,Ozone,47,31140,"Louisville/Jefferson County, KY-IN",18,Indiana,43,Floyd,38.317813,-85.833322
346,01/04/2024,AQS,180430008,1,0.022,ppm,20,,17,100.0,...,Ozone,47,31140,"Louisville/Jefferson County, KY-IN",18,Indiana,43,Floyd,38.317813,-85.833322
347,01/05/2024,AQS,180430008,1,0.016,ppm,15,,17,100.0,...,Ozone,47,31140,"Louisville/Jefferson County, KY-IN",18,Indiana,43,Floyd,38.317813,-85.833322
348,01/06/2024,AQS,180430008,1,0.019,ppm,18,,17,100.0,...,Ozone,47,31140,"Louisville/Jefferson County, KY-IN",18,Indiana,43,Floyd,38.317813,-85.833322
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
680,12/26/2024,AQS,180430008,1,0.022,ppm,20,,17,100.0,...,Ozone,87,31140,"Louisville/Jefferson County, KY-IN",18,Indiana,43,Floyd,38.317813,-85.833322
681,12/27/2024,AQS,180430008,1,0.025,ppm,23,,17,100.0,...,Ozone,87,31140,"Louisville/Jefferson County, KY-IN",18,Indiana,43,Floyd,38.317813,-85.833322
682,12/28/2024,AQS,180430008,1,0.018,ppm,17,,17,100.0,...,Ozone,87,31140,"Louisville/Jefferson County, KY-IN",18,Indiana,43,Floyd,38.317813,-85.833322
683,12/29/2024,AQS,180430008,1,0.041,ppm,38,,17,100.0,...,Ozone,87,31140,"Louisville/Jefferson County, KY-IN",18,Indiana,43,Floyd,38.317813,-85.833322


### (Issue: Deal with Nulls):

Filling in the nulls with 'unknown', as they all come from the same latitude and longitude.  Also, this site is in Floyd County, not Jeffersone, and I will be removing the data from sites that are not in Jefferson County, as I am focusing on Lousiville Metro data.

In [40]:
ozone_df['Local Site Name'] = ozone_df['Local Site Name'].fillna('Unknown')

In [41]:
ozone_df.isnull().sum()

Date                                    0
Source                                  0
Site ID                                 0
POC                                     0
Daily Max 8-hour Ozone Concentration    0
Units                                   0
Daily AQI Value                         0
Local Site Name                         0
Daily Obs Count                         0
Percent Complete                        0
AQS Parameter Code                      0
AQS Parameter Description               0
Method Code                             0
CBSA Code                               0
CBSA Name                               0
State FIPS Code                         0
State                                   0
County FIPS Code                        0
County                                  0
Site Latitude                           0
Site Longitude                          0
dtype: int64

Converting 'Date' column from object to datetime:

In [42]:
ozone_df['Date'] = pd.to_datetime(ozone_df['Date'])

In [43]:
ozone_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2243 entries, 0 to 2242
Data columns (total 21 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   Date                                  2243 non-null   datetime64[ns]
 1   Source                                2243 non-null   object        
 2   Site ID                               2243 non-null   int64         
 3   POC                                   2243 non-null   int64         
 4   Daily Max 8-hour Ozone Concentration  2243 non-null   float64       
 5   Units                                 2243 non-null   object        
 6   Daily AQI Value                       2243 non-null   int64         
 7   Local Site Name                       2243 non-null   object        
 8   Daily Obs Count                       2243 non-null   int64         
 9   Percent Complete                      2243 non-null   float64       
 10  

Removing columns and renaming columns:

In [44]:
ozone_df.columns = ozone_df.columns.str.replace(' ', '_')

In [45]:
ozone_df.rename(columns = {'Daily_Max_8-hour_Ozone_Concentration': 'Max_Concentration', 'Daily_AQI_Value': 'AQI_Value', 'AQS_Parameter_Description': 'Substance_Measured'}, inplace=True)

In [46]:
ozone_df.drop(['Daily_Obs_Count', 'Units', 'Source', 'POC', 'Method_Code', 'CBSA_Code', 'CBSA_Name', 'State_FIPS_Code', 'State', 'County_FIPS_Code', 'AQS_Parameter_Code', 'Percent_Complete', 'Site_ID', 'Site_Latitude', 'Site_Longitude'], axis=1, inplace=True)

In [47]:
ozone_df.head()

Unnamed: 0,Date,Max_Concentration,AQI_Value,Local_Site_Name,Substance_Measured,County
0,2024-01-01,0.018,17,Charlestown State Park- 1051.8 meters East of ...,Ozone,Clark
1,2024-01-02,0.022,20,Charlestown State Park- 1051.8 meters East of ...,Ozone,Clark
2,2024-01-03,0.024,22,Charlestown State Park- 1051.8 meters East of ...,Ozone,Clark
3,2024-01-04,0.025,23,Charlestown State Park- 1051.8 meters East of ...,Ozone,Clark
4,2024-01-05,0.024,22,Charlestown State Park- 1051.8 meters East of ...,Ozone,Clark


Not all sites are in Lousiville Metro- removing counties that are not Jefferson:

In [48]:
ozone_df['County'].value_counts()

County
Jefferson    1077
Clark         344
Floyd         341
Bullitt       241
Oldham        240
Name: count, dtype: int64

In [49]:
ozone_df = ozone_df[ozone_df['County'] == 'Jefferson']

In [50]:
ozone_df

Unnamed: 0,Date,Max_Concentration,AQI_Value,Local_Site_Name,Substance_Measured,County
926,2024-03-01,0.023,21,Watson Lane,Ozone,Jefferson
927,2024-03-02,0.024,22,Watson Lane,Ozone,Jefferson
928,2024-03-03,0.034,31,Watson Lane,Ozone,Jefferson
929,2024-03-04,0.032,30,Watson Lane,Ozone,Jefferson
930,2024-03-05,0.029,27,Watson Lane,Ozone,Jefferson
...,...,...,...,...,...,...
1998,2024-10-26,0.039,36,Algonquin Parkway,Ozone,Jefferson
1999,2024-10-27,0.040,37,Algonquin Parkway,Ozone,Jefferson
2000,2024-10-28,0.036,33,Algonquin Parkway,Ozone,Jefferson
2001,2024-10-29,0.046,43,Algonquin Parkway,Ozone,Jefferson


Checking datatypes:

In [51]:
ozone_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1077 entries, 926 to 2002
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Date                1077 non-null   datetime64[ns]
 1   Max_Concentration   1077 non-null   float64       
 2   AQI_Value           1077 non-null   int64         
 3   Local_Site_Name     1077 non-null   object        
 4   Substance_Measured  1077 non-null   object        
 5   County              1077 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 58.9+ KB


Checking for duplicates:

In [52]:
ozone_df[ozone_df.duplicated(subset=['Date', 'Local_Site_Name'])]

Unnamed: 0,Date,Max_Concentration,AQI_Value,Local_Site_Name,Substance_Measured,County


## Issue: Get General Info from Data:

(I asked chatgbt "what would be considered getting general info from data?" and am using that as a template here)

Basic structure:

In [53]:
temp_df.shape

(366, 4)

In [54]:
ozone_df.shape

(1077, 6)

In [55]:
temp_df.columns

Index(['Station_Name', 'Date', 'Max_Temp', 'Min_Temp'], dtype='object')

In [56]:
ozone_df.columns

Index(['Date', 'Max_Concentration', 'AQI_Value', 'Local_Site_Name',
       'Substance_Measured', 'County'],
      dtype='object')

In [57]:
temp_df.index

RangeIndex(start=0, stop=366, step=1)

In [58]:
ozone_df.index

Index([ 926,  927,  928,  929,  930,  931,  932,  933,  934,  935,
       ...
       1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002],
      dtype='int64', length=1077)

Data types and non-null counts:

In [59]:
temp_df.dtypes

Station_Name            object
Date            datetime64[ns]
Max_Temp                 int64
Min_Temp                 int64
dtype: object

In [60]:
ozone_df.dtypes

Date                  datetime64[ns]
Max_Concentration            float64
AQI_Value                      int64
Local_Site_Name               object
Substance_Measured            object
County                        object
dtype: object

In [61]:
temp_df.info

<bound method DataFrame.info of                                   Station_Name       Date  Max_Temp  Min_Temp
0    LOUISVILLE WEATHER FORECAST OFFICE, KY US 2024-01-01        37        31
1    LOUISVILLE WEATHER FORECAST OFFICE, KY US 2024-01-02        40        29
2    LOUISVILLE WEATHER FORECAST OFFICE, KY US 2024-01-03        42        23
3    LOUISVILLE WEATHER FORECAST OFFICE, KY US 2024-01-04        41        27
4    LOUISVILLE WEATHER FORECAST OFFICE, KY US 2024-01-05        44        22
..                                         ...        ...       ...       ...
361  LOUISVILLE WEATHER FORECAST OFFICE, KY US 2024-12-27        55        46
362  LOUISVILLE WEATHER FORECAST OFFICE, KY US 2024-12-28        58        51
363  LOUISVILLE WEATHER FORECAST OFFICE, KY US 2024-12-29        60        45
364  LOUISVILLE WEATHER FORECAST OFFICE, KY US 2024-12-30        53        43
365  LOUISVILLE WEATHER FORECAST OFFICE, KY US 2024-12-31        55        39

[366 rows x 4 columns]>

In [62]:
ozone_df.info

<bound method DataFrame.info of            Date  Max_Concentration  AQI_Value    Local_Site_Name  \
926  2024-03-01              0.023         21        Watson Lane   
927  2024-03-02              0.024         22        Watson Lane   
928  2024-03-03              0.034         31        Watson Lane   
929  2024-03-04              0.032         30        Watson Lane   
930  2024-03-05              0.029         27        Watson Lane   
...         ...                ...        ...                ...   
1998 2024-10-26              0.039         36  Algonquin Parkway   
1999 2024-10-27              0.040         37  Algonquin Parkway   
2000 2024-10-28              0.036         33  Algonquin Parkway   
2001 2024-10-29              0.046         43  Algonquin Parkway   
2002 2024-10-30              0.042         39  Algonquin Parkway   

     Substance_Measured     County  
926               Ozone  Jefferson  
927               Ozone  Jefferson  
928               Ozone  Jefferson  
929

Quick stats for numeric columns:

In [63]:
temp_df.describe()

Unnamed: 0,Date,Max_Temp,Min_Temp
count,366,366.0,366.0
mean,2024-07-01 12:00:00.000000256,69.448087,50.016393
min,2024-01-01 00:00:00,14.0,0.0
25%,2024-04-01 06:00:00,56.25,38.0
50%,2024-07-01 12:00:00,73.0,52.0
75%,2024-09-30 18:00:00,84.75,64.0
max,2024-12-31 00:00:00,99.0,79.0
std,,17.844727,16.59542


In [64]:
ozone_df.describe()

Unnamed: 0,Date,Max_Concentration,AQI_Value
count,1077,1077.0,1077.0
mean,2024-06-30 22:42:27.075209216,0.04416,43.514392
min,2024-01-01 00:00:00,0.013,12.0
25%,2024-04-23 00:00:00,0.036,33.0
50%,2024-06-30 00:00:00,0.044,41.0
75%,2024-09-08 00:00:00,0.052,48.0
max,2024-12-30 00:00:00,0.088,156.0
std,,0.011885,17.523352


In [65]:
temp_df.describe(include='all')

Unnamed: 0,Station_Name,Date,Max_Temp,Min_Temp
count,366,366,366.0,366.0
unique,1,,,
top,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",,,
freq,366,,,
mean,,2024-07-01 12:00:00.000000256,69.448087,50.016393
min,,2024-01-01 00:00:00,14.0,0.0
25%,,2024-04-01 06:00:00,56.25,38.0
50%,,2024-07-01 12:00:00,73.0,52.0
75%,,2024-09-30 18:00:00,84.75,64.0
max,,2024-12-31 00:00:00,99.0,79.0


In [66]:
ozone_df.describe(include='all')

Unnamed: 0,Date,Max_Concentration,AQI_Value,Local_Site_Name,Substance_Measured,County
count,1077,1077.0,1077.0,1077,1077,1077
unique,,,,4,1,1
top,,,,CANNONS LANE,Ozone,Jefferson
freq,,,,363,1077,1077
mean,2024-06-30 22:42:27.075209216,0.04416,43.514392,,,
min,2024-01-01 00:00:00,0.013,12.0,,,
25%,2024-04-23 00:00:00,0.036,33.0,,,
50%,2024-06-30 00:00:00,0.044,41.0,,,
75%,2024-09-08 00:00:00,0.052,48.0,,,
max,2024-12-30 00:00:00,0.088,156.0,,,


Check for missing data:

In [67]:
temp_df.isnull().sum()

Station_Name    0
Date            0
Max_Temp        0
Min_Temp        0
dtype: int64

In [68]:
ozone_df.isnull().sum()

Date                  0
Max_Concentration     0
AQI_Value             0
Local_Site_Name       0
Substance_Measured    0
County                0
dtype: int64

Check for duplicates:

In [69]:
temp_df.duplicated().sum

<bound method Series.sum of 0      False
1      False
2      False
3      False
4      False
       ...  
361    False
362    False
363    False
364    False
365    False
Length: 366, dtype: bool>

In [70]:
ozone_df.duplicated().sum

<bound method Series.sum of 926     False
927     False
928     False
929     False
930     False
        ...  
1998    False
1999    False
2000    False
2001    False
2002    False
Length: 1077, dtype: bool>

In [71]:
temp_df[temp_df.duplicated()]

Unnamed: 0,Station_Name,Date,Max_Temp,Min_Temp


In [72]:
ozone_df[ozone_df.duplicated()]

Unnamed: 0,Date,Max_Concentration,AQI_Value,Local_Site_Name,Substance_Measured,County


Unique values in columns:

In [73]:
temp_df['Date'].unique()

<DatetimeArray>
['2024-01-01 00:00:00', '2024-01-02 00:00:00', '2024-01-03 00:00:00',
 '2024-01-04 00:00:00', '2024-01-05 00:00:00', '2024-01-06 00:00:00',
 '2024-01-07 00:00:00', '2024-01-08 00:00:00', '2024-01-09 00:00:00',
 '2024-01-10 00:00:00',
 ...
 '2024-12-22 00:00:00', '2024-12-23 00:00:00', '2024-12-24 00:00:00',
 '2024-12-25 00:00:00', '2024-12-26 00:00:00', '2024-12-27 00:00:00',
 '2024-12-28 00:00:00', '2024-12-29 00:00:00', '2024-12-30 00:00:00',
 '2024-12-31 00:00:00']
Length: 366, dtype: datetime64[ns]

In [74]:
#looks like missing readings from two days of the year
ozone_df['Date'].unique()

<DatetimeArray>
['2024-03-01 00:00:00', '2024-03-02 00:00:00', '2024-03-03 00:00:00',
 '2024-03-04 00:00:00', '2024-03-05 00:00:00', '2024-03-06 00:00:00',
 '2024-03-07 00:00:00', '2024-03-08 00:00:00', '2024-03-09 00:00:00',
 '2024-03-10 00:00:00',
 ...
 '2024-12-22 00:00:00', '2024-12-23 00:00:00', '2024-12-24 00:00:00',
 '2024-12-25 00:00:00', '2024-12-26 00:00:00', '2024-12-27 00:00:00',
 '2024-12-28 00:00:00', '2024-12-29 00:00:00', '2024-12-30 00:00:00',
 '2024-03-19 00:00:00']
Length: 364, dtype: datetime64[ns]

Viewing head of both dataframes:

In [75]:
temp_df.head()

Unnamed: 0,Station_Name,Date,Max_Temp,Min_Temp
0,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-01,37,31
1,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-02,40,29
2,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-03,42,23
3,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-04,41,27
4,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-05,44,22


In [76]:
ozone_df.head()

Unnamed: 0,Date,Max_Concentration,AQI_Value,Local_Site_Name,Substance_Measured,County
926,2024-03-01,0.023,21,Watson Lane,Ozone,Jefferson
927,2024-03-02,0.024,22,Watson Lane,Ozone,Jefferson
928,2024-03-03,0.034,31,Watson Lane,Ozone,Jefferson
929,2024-03-04,0.032,30,Watson Lane,Ozone,Jefferson
930,2024-03-05,0.029,27,Watson Lane,Ozone,Jefferson


### Issue: Clean Up Code:
Code is clean and markdown cells used for organization and a clean look.

### Issue: Feature Engineering:

In [77]:
temp_df

Unnamed: 0,Station_Name,Date,Max_Temp,Min_Temp
0,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-01,37,31
1,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-02,40,29
2,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-03,42,23
3,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-04,41,27
4,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-05,44,22
...,...,...,...,...
361,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-12-27,55,46
362,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-12-28,58,51
363,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-12-29,60,45
364,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-12-30,53,43


In [78]:
ozone_df

Unnamed: 0,Date,Max_Concentration,AQI_Value,Local_Site_Name,Substance_Measured,County
926,2024-03-01,0.023,21,Watson Lane,Ozone,Jefferson
927,2024-03-02,0.024,22,Watson Lane,Ozone,Jefferson
928,2024-03-03,0.034,31,Watson Lane,Ozone,Jefferson
929,2024-03-04,0.032,30,Watson Lane,Ozone,Jefferson
930,2024-03-05,0.029,27,Watson Lane,Ozone,Jefferson
...,...,...,...,...,...,...
1998,2024-10-26,0.039,36,Algonquin Parkway,Ozone,Jefferson
1999,2024-10-27,0.040,37,Algonquin Parkway,Ozone,Jefferson
2000,2024-10-28,0.036,33,Algonquin Parkway,Ozone,Jefferson
2001,2024-10-29,0.046,43,Algonquin Parkway,Ozone,Jefferson


Creating a column to describe the ozone levels- 

AQI = Air Quality Index

0-50 good, 51-100 moderate, 101-150 unhealthy for sensitive groups, 151-200 unhealthy

Alerts issued when AQI is expected to reach 101 or more

In [79]:
#Creating a new column, filtering AQI Values into categories
ozone_df["Ozone_Levels"] = pd.cut(
    ozone_df["AQI_Value"],
    bins=[0, 50, 100, 150, 200],
    labels=["Good", "Moderate", "Unhealthy Sensitive", "Unhealthy"]
)
ozone_df

Unnamed: 0,Date,Max_Concentration,AQI_Value,Local_Site_Name,Substance_Measured,County,Ozone_Levels
926,2024-03-01,0.023,21,Watson Lane,Ozone,Jefferson,Good
927,2024-03-02,0.024,22,Watson Lane,Ozone,Jefferson,Good
928,2024-03-03,0.034,31,Watson Lane,Ozone,Jefferson,Good
929,2024-03-04,0.032,30,Watson Lane,Ozone,Jefferson,Good
930,2024-03-05,0.029,27,Watson Lane,Ozone,Jefferson,Good
...,...,...,...,...,...,...,...
1998,2024-10-26,0.039,36,Algonquin Parkway,Ozone,Jefferson,Good
1999,2024-10-27,0.040,37,Algonquin Parkway,Ozone,Jefferson,Good
2000,2024-10-28,0.036,33,Algonquin Parkway,Ozone,Jefferson,Good
2001,2024-10-29,0.046,43,Algonquin Parkway,Ozone,Jefferson,Good


Creating column Ozone_Alert that will be True when AQI level is above 100:

In [80]:
ozone_df["Ozone_Alert"] = ozone_df["AQI_Value"] > 100
ozone_df

Unnamed: 0,Date,Max_Concentration,AQI_Value,Local_Site_Name,Substance_Measured,County,Ozone_Levels,Ozone_Alert
926,2024-03-01,0.023,21,Watson Lane,Ozone,Jefferson,Good,False
927,2024-03-02,0.024,22,Watson Lane,Ozone,Jefferson,Good,False
928,2024-03-03,0.034,31,Watson Lane,Ozone,Jefferson,Good,False
929,2024-03-04,0.032,30,Watson Lane,Ozone,Jefferson,Good,False
930,2024-03-05,0.029,27,Watson Lane,Ozone,Jefferson,Good,False
...,...,...,...,...,...,...,...,...
1998,2024-10-26,0.039,36,Algonquin Parkway,Ozone,Jefferson,Good,False
1999,2024-10-27,0.040,37,Algonquin Parkway,Ozone,Jefferson,Good,False
2000,2024-10-28,0.036,33,Algonquin Parkway,Ozone,Jefferson,Good,False
2001,2024-10-29,0.046,43,Algonquin Parkway,Ozone,Jefferson,Good,False


In [81]:
temp_df['Extreme_Weather'] = (temp_df['Max_Temp'] >= 95) | (temp_df['Min_Temp'] <= 35)
temp_df

Unnamed: 0,Station_Name,Date,Max_Temp,Min_Temp,Extreme_Weather
0,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-01,37,31,True
1,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-02,40,29,True
2,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-03,42,23,True
3,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-04,41,27,True
4,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-01-05,44,22,True
...,...,...,...,...,...
361,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-12-27,55,46,False
362,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-12-28,58,51,False
363,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-12-29,60,45,False
364,"LOUISVILLE WEATHER FORECAST OFFICE, KY US",2024-12-30,53,43,False


## (If I can find windspeed and humidity data, I can calculate windchill and heat index, which will show additional days weather is considered extreme.)
Creating functions to convert temp and windspeed into wind chill, and temp and humidity into heat index. 

Formula from National Weather Service.

If temp is above 50 or wind speed is below 3mph, windchill = air temperature without calculations.

In [126]:
def wind_chill(temp, wind):
    # temp in fahrenheit, wind in mph
    #temp must <= 50, wind must be >= 3
    if wind >= 3 and temp <= 50:
        wind_ch = 35.74 + 0.6215*temp - 35.75*(wind**0.16) + 0.4275*temp*(wind**0.16)
        return(f"{wind_ch: .2f}")
    else:
        return temp
    
wind_chill(40, 15)

' 31.84'

Creating a function for calculating heat index.  Formula from National Weather Service.

If temperature is below 80 or humidity is below 40%, the heat index = the air temperature.

This function has the formula for heat index, and returns the air temperature without calculating if temp < 80 or humidity < 40%.

In [125]:
def heat_index(temp, humidity):
    #temp in fahrenheit, relative humidity in 1-100%
    #temp must be >= 80, relative humidity >= 40%
    if temp >= 80 and humidity >= 40:
        heat_ind = (-42.379 + 2.04901523*temp + 10.14333127*humidity - 0.22475541*temp*humidity
          - 6.83783e-3*temp**2 - 5.481717e-2*humidity**2
          + 1.22874e-3*temp**2*humidity + 8.5282e-4*temp*humidity**2
          - 1.99e-6*temp**2*humidity**2)
        
        return(f"{heat_ind: .2f}")
    else:
        return temp
    
heat_index(88, 60)

' 95.15'