# Import Libraries and Load P_WX2013.csv Data

## Imports

In [51]:
# add imports

import pandas as pd
import numpy as np

## Load Data

In [52]:
# load data from P_WX2013.csv and display the first 3 rows

df = pd.read_csv('P_WX2013.csv', dtype={'DATE':object, 'PRCP':object,
'SNWD':object, 'SNOW':object,
'TMAX':object, 'TMIN':object,})

df[:3]


Unnamed: 0,STATION,DATE,PRCP,SNWD,SNOW,TMAX,TMIN
0,GHCND:USW00014764,20130101,0,254,0,0,-117
1,GHCND:USW00014764,20130102,0,254,0,-44,-161
2,GHCND:USW00014764,20130103,0,229,0,-50,-178


# Summarized Data

In [53]:
# use .info() to get a sense of the data
# create a text cell below to summarize your findings

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   STATION  365 non-null    object
 1   DATE     364 non-null    object
 2   PRCP     364 non-null    object
 3   SNWD     365 non-null    object
 4   SNOW     365 non-null    object
 5   TMAX     364 non-null    object
 6   TMIN     364 non-null    object
dtypes: object(7)
memory usage: 20.1+ KB


There are 7 different columns and of all their data types are objects. There are 365 entries, so 365 rows. Station, SNWD, and SNOW have non-null information for all entries. However, the rest of columns have 1 null object out of 365.

## Summary of Basic Descriptive Statistics
```
df.describe()
```

Returns a Dataframe with 6 basic descriptive statistics as indices and the values of these statistics for each column that has numerical data.

In [54]:
# display the descriptive statistics for the df
# create a text cell below to summarize your findings

df.describe()

Unnamed: 0,STATION,DATE,PRCP,SNWD,SNOW,TMAX,TMIN
count,365,364,364,365,365,364,364
unique,1,363,63,25,20,87,92
top,GHCND:USW00014764,20130321,0,0,0,189,0
freq,365,2,234,268,336,12,16


Station only has one entry that occurs for all 365 entries, which is "GHCND:USW00014764." The majority of the DATE column's entries are unique, except for one date the repeats itself, "20130321." PRCP has quite a few unique entries, but 0 is the most fruequent, occuring 234 times. SNWD most fruequent entry is also 0 appearing 268 times, but has a relativley low number of unique entries. SNOW like the previous two columns most fruequent data entry is 0 appearing a staggering 336 times with only 19 other unique entries besides 0. TMAX has 87 unique entries with the most common one being 289, appearing 12 times. TMIN has a similar number of unqiue entries. It's top entry like most of the columns is 0, appearing 16 times. DATE, PRCP, TMAX, and TMIN all have 364 entries compared to 365.

## Unique Values
```
df['col_nam'].unique()
```

Returns a Numpy array containing the unique values in the specified column.



In [55]:
# look at the unique values are there in a column you choose
# create a text cell below to summarize your findings
df['SNOW'].unique()


array(['0', '3', '130', '48', '5', '389', '422', '51', '272', '74', '38',
       '13', '221', '23', '18', '46', '269', '229', '33', '-9999'],
      dtype=object)

SNOW has 20 unique entries and their datatype is an object. It also has an odd -9999 entry. 

```
df['col_name'].nunique()
```
Returns the number of unique values in the column.

```
df.nunique()
```
Returns the number of unique values in each column of the dataframe.


In [56]:
# how many unique values are there in that column

df['SNOW'].nunique()


# how many unique values are there in the whole dataframe

df.nunique()

STATION      1
DATE       363
PRCP        63
SNWD        25
SNOW        20
TMAX        87
TMIN        92
dtype: int64

## Value Counts

`df[col_name].value_counts()`

Returns a Series with the unique values as indices and the value counts as the values.

In [57]:
# run this function on the TMAX column; create a text cell below and describe
# how many unique temps are in that column and describe what the meaning of the
# first entry in the series

# df['TMAX'].value_counts()

# run this function on the entire dataframe and summarize what information you
# can glean from the output

df.value_counts()


STATION            DATE      PRCP  SNWD   SNOW   TMAX  TMIN
GHCND:USW00014764  20130321  0     178    0      33    -61     2
                   20130830  0     0      0      267   161     1
                   20130909  0     0      0      206   44      1
                   20130908  3     0      0      206   72      1
                   20130907  0     0      0      261   89      1
                                                              ..
                   20130501  0     0      0      172   0       1
                   20130430  0     0      0      178   22      1
                   20130429  0     0      0      156   28      1
                   20130428  0     0      0      161   11      1
                   20131231  3     -9999  -9999  -105  -171    1
Name: count, Length: 360, dtype: int64

1. There are 87 unique temps in the column TMAX, and you can tell because of the length of value_counts(). The first entry tells us the most common temperature is 189 and it happens 12 days of the year. 

2. In Decemeber, SNWD and SNOW have a entry that is absurdly high and probably wrong of -9999. Running the function on the whole data set also tells us that one of the days has information that repeats. Every day should be unqiue and thats why all of the counts are 1, exepct for the first value in the series which is 2. This means that when using the data set we probably have to worry about that and deal with it somehow. 

# DateTime Library



## Converting Dates to Datetime Objects

[pd.to_datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html)

```
df[col_name] = pd.to_datetime(df[col_name])
```

This function is used to convert various data types (strings, integers, floats, lists of date components) into datetime objects. It can parse a wide range of formats and offers options for handling errors or specifying formats.

In [58]:
# convert the DATE column to datetime objects and run .info()
# on the dataframe to confirm that the change was made
df['DATE'] = pd.to_datetime(df['DATE'])
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   STATION  365 non-null    object        
 1   DATE     364 non-null    datetime64[ns]
 2   PRCP     364 non-null    object        
 3   SNWD     365 non-null    object        
 4   SNOW     365 non-null    object        
 5   TMAX     364 non-null    object        
 6   TMIN     364 non-null    object        
dtypes: datetime64[ns](1), object(6)
memory usage: 20.1+ KB


We know it worked because the data type for DATE is now datetime64[ns]

## Accessing Periods of Time

To access the specific periods of a datetime object:

`df[col_name].dt.year` returns the year of the date time.

`df[col_name].dt.month` returns the month of the date time.

`df[col_name].dt.day` returns the day of the date time.

`df[col_name].dt.hour` returns the hour of the date time.

`df[col_name].dt.minute` returns the minute of the date time.

In [59]:
# try accessing each part of the dates in the DATE column

df_year = df['DATE'].dt.year
df_month = df['DATE'].dt.month
df_day = df['DATE'].dt.day
df_hour = df['DATE'].dt.hour
df_minute = df['DATE'].dt.minute


In [60]:
df_year

0      2013.0
1      2013.0
2      2013.0
3      2013.0
4      2013.0
        ...  
360    2013.0
361    2013.0
362    2013.0
363    2013.0
364    2013.0
Name: DATE, Length: 365, dtype: float64

In [61]:
df_month

0       1.0
1       1.0
2       1.0
3       1.0
4       1.0
       ... 
360    12.0
361    12.0
362    12.0
363    12.0
364    12.0
Name: DATE, Length: 365, dtype: float64

In [62]:
df_day

0       1.0
1       2.0
2       3.0
3       4.0
4       5.0
       ... 
360    27.0
361    28.0
362    29.0
363    30.0
364    31.0
Name: DATE, Length: 365, dtype: float64

In [63]:
df_hour

0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
      ... 
360    0.0
361    0.0
362    0.0
363    0.0
364    0.0
Name: DATE, Length: 365, dtype: float64

In [64]:
df_minute

0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
      ... 
360    0.0
361    0.0
362    0.0
363    0.0
364    0.0
Name: DATE, Length: 365, dtype: float64

In [65]:
# get all rows from April

april = df['DATE'].dt.month == 4

df_april = df.loc[april]
df_april

Unnamed: 0,STATION,DATE,PRCP,SNWD,SNOW,TMAX,TMIN
90,GHCND:USW00014764,2013-04-01,160,0,0,128,0
91,GHCND:USW00014764,2013-04-02,0,0,0,39,-21
92,GHCND:USW00014764,2013-04-03,0,0,0,72,-16
93,GHCND:USW00014764,2013-04-04,0,0,0,117,-32
94,GHCND:USW00014764,2013-04-05,0,0,0,122,17
95,GHCND:USW00014764,2013-04-06,0,0,0,56,-32
96,GHCND:USW00014764,2013-04-07,0,0,0,78,-38
97,GHCND:USW00014764,2013-04-08,0,0,0,161,17
98,GHCND:USW00014764,2013-04-09,46,0,0,111,56
99,GHCND:USW00014764,2013-04-10,119,0,0,128,56


In [66]:
# get the TMAX values from the April rows

df_april['TMAX']

90     128
91      39
92      72
93     117
94     122
95      56
96      78
97     161
98     111
99     128
100    117
101     44
102     89
103    100
104    106
105    139
106    172
107     94
108    183
109    139
110    111
111     83
112     56
113    178
114    178
115    117
116    139
117    161
118    156
119    178
Name: TMAX, dtype: object