**This notebook is an exercise in the [Data Cleaning](https://www.kaggle.com/learn/data-cleaning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/parsing-dates).**

---


In this exercise, you'll apply what you learned in the **Parsing dates** tutorial.

# Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

In [67]:
from learntools.core import binder
binder.bind(globals())
from learntools.data_cleaning.ex3 import *
print("Setup Complete")

Setup Complete


# Get our environment set up

The first thing we'll need to do is load in the libraries and dataset we'll be using. We'll be working with a dataset containing information on earthquakes that occured between 1965 and 2016.

In [68]:
# modules we'll use
import pandas as pd
import numpy as np
import seaborn as sns
import datetime

# read in our data
earthquakes = pd.read_csv("../input/earthquake-database/database.csv")

# set seed for reproducibility
np.random.seed(0)

# 1) Check the data type of our date column

You'll be working with the "Date" column from the `earthquakes` dataframe.  Investigate this column now: does it look like it contains dates?  What is the dtype of the column?

In [69]:
# TODO: Your code here!
earthquakes['Date'].dtype

dtype('O')

Once you have answered the question above, run the code cell below to get credit for your work.

In [70]:
# Check your answer (Run this code cell to receive credit!)
print(earthquakes['Date'].head())


0    01/02/1965
1    01/04/1965
2    01/05/1965
3    01/08/1965
4    01/09/1965
Name: Date, dtype: object


In [71]:
# Line below will give you a hint
q1.hint()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Use `earthquakes['Date'].head()` to check that the column contains dates and verify that it has dtype "object".  You can also use `earthquakes['Date'].dtype` to verify the dtype.

# 2) Convert our date columns to datetime

Most of the entries in the "Date" column follow the same format: "month/day/four-digit year".  However, the entry at index 3378 follows a completely different pattern.  Run the code cell below to see this.

In [None]:
earthquakes['date_parsed'] = pd.to_datetime(earthquakes ['Date'], format="%m/%d/%y")

In [None]:
earthquakes['lenDate'] = earthquakes['Date'].apply(len)
earthquakes.loc[earthquakes['lenDate'] > 10]

In [None]:
earthquakes.loc[3378, "Date"] = "02/23/1975"
earthquakes.loc[7512, "Date"] = "04/28/1986"
earthquakes.loc[20650, "Date"] = "03/13/2011"
earthquakes.loc[3378, "Date"] = '02:58:41'
earthquakes.loc[7512, "Date"] = '02:53:41'
earthquakes.loc[20650, "Date"] = '02:23:34'
earthquakes.loc[[3378, 7512, 20650]]

In [None]:
earthquakes['date_parsed'] = pd.to_datetime(earthquakes['Date'], format = "%m/%d/%y")

In [None]:
earthquakes['date_parsed'].head()

This does appear to be an issue with data entry: ideally, all entries in the column have the same format.  We can get an idea of how widespread this issue is by checking the length of each entry in the "Date" column.

In [73]:
date_lengths = earthquakes.Date.str.len()
date_lengths.value_counts()

10    23409
24        3
Name: Date, dtype: int64

Looks like there are two more rows that has a date in a different format.  Run the code cell below to obtain the indices corresponding to those rows and print the data.

In [74]:
indices = np.where([date_lengths == 24])[1]
print('Indices with corrupted data:', indices)
earthquakes.loc[indices]

Indices with corrupted data: [ 3378  7512 20650]


Unnamed: 0,Date,Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Type,...,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status
3378,1975-02-23T02:58:41.000Z,1975-02-23T02:58:41.000Z,8.017,124.075,Earthquake,623.0,,,5.6,MB,...,,,,,,USP0000A09,US,US,US,Reviewed
7512,1985-04-28T02:53:41.530Z,1985-04-28T02:53:41.530Z,-32.998,-71.766,Earthquake,33.0,,,5.6,MW,...,,,,,1.3,USP0002E81,US,US,HRV,Reviewed
20650,2011-03-13T02:23:34.520Z,2011-03-13T02:23:34.520Z,36.344,142.344,Earthquake,10.1,13.9,289.0,5.8,MWC,...,,32.3,,,1.06,USP000HWQP,US,US,GCMT,Reviewed


Given all of this information, it's your turn to create a new column "date_parsed" in the `earthquakes` dataset that has correctly parsed dates in it.  

**Note**: When completing this problem, you are allowed to (but are not required to) amend the entries in the "Date" and "Time" columns.  Do not remove any rows from the dataset.

In [None]:
earthquakes['date_parsed'] = pd.to_datetime(earthquakes['Date'], format="%m/%d/%y")

In [None]:
# Lines below will give you a hint or solution code
q2.check()
q2.hint()
q2.solution()

In [None]:
earthquakes['lenDate'] = earthquakes['Date'].apply(len)
earthquakes.loc[earthquakes['lenDate'] > 10]

# 3) Select the day of the month

Create a Pandas Series `day_of_month_earthquakes` containing the day of the month from the "date_parsed" column.

In [None]:
# try to get the day of the month from the date column
day_of_month_earthquakes = earthquakes['date_parsed'].dt.day


# Check your answer
q3.check()

In [None]:
# Lines below will give you a hint or solution code
q3.check()
q3.hint()
q3.solution()

# 4) Plot the day of the month to check the date parsing

Plot the days of the month from your earthquake dataset.

In [None]:
# TODO: Your code here!
day_of_month_earthquakes = day_of_month_earthquakes.dropna()
sns.distplot(day_of_month_earthquakes, kde=False, bins=31)

Does the graph make sense to you?

In [76]:
# Check your answer (Run this code cell to receive credit!)
q4.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

The graph should make sense: it shows a relatively even distribution in days of the month,which is what we would expect.

In [77]:
# Line below will give you a hint
q4.hint()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> 
Remove the missing values, and then use `sns.distplot()` as follows:

```python
# remove na's
day_of_month_earthquakes = day_of_month_earthquakes.dropna()

# plot the day of the month
sns.distplot(day_of_month_earthquakes, kde=False, bins=31)
```

