<a href="https://colab.research.google.com/github/T-Sunm/Learn-Data-Cleaning-in-kaggle/blob/main/Exercise_Parsing_Dates.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'volcanic-eruptions:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F705%2F1325%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240828%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240828T173012Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D02d0415dd8ae03cda017dfb2af859421d62f5b73eb2cfd6252362a847fc97354555906ed2d1c6332eb26892db1cf3b8e91bc6d4a10db1c8f42df556be94ce75f8a16a509a8d3f2bb767b885e39afebb8d82e619e19e3dfbef6d6259fc8fb4ccd5ade5b84fb0a6b99cceb494d9a46fb22ba88ce3504338dada3ebc7137d5245e6c752c27e68c3014294abf2f359b759fb2f98081c67dc5606ea81c9ac4494d6e1048ee09f071799abe216fb439a65b487c4231ff3e7c837cfa63cdd5cff03dbc56c0677f177dfb37e0a9b1563f375ebe46ebca94313dd553f0640ed52b22c00d9fdb3785564ba1f16ca524a3def60becbe1591a9b44999f641b4f3058f7bb239a,earthquake-database:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F732%2F1360%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240828%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240828T173012Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D84e196c3b7ebafa08feb1841e5cf0867cf45de4462dfeb95e2a401eaf32f9b370633a883d557fbc6458bb6e54b23844242cdc9912d5109bac1fb63ba2f0103b26890d60a4b80fdeb596cce2548859330d1deac2e0e5068571ca685601732da35df21e2dea31fd3e01b69cf604e516de9cc371d0a7bff7ade5ef5e52e2172640b964b9e822ef95bb639dead3c5e9dea28d714bc65d160dd07cb38224eea7117ee009a935f19cbf4bed5d1e82686647c432499ca2371bcb4e86f3b96e532f886e2c896d095944f6bc6759b03c121697ec75f1817f9fc5d9ce7064e1b35acf8a3067736fb5b6656eb6f5036cb33c81a9ed8322a011001742507d474c087a0a54a99'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


**This notebook is an exercise in the [Data Cleaning](https://www.kaggle.com/learn/data-cleaning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/parsing-dates).**

---


In this exercise, you'll apply what you learned in the **Parsing dates** tutorial.

# Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

In [None]:
from learntools.core import binder
binder.bind(globals())
from learntools.data_cleaning.ex3 import *
print("Setup Complete")

# Get our environment set up

The first thing we'll need to do is load in the libraries and dataset we'll be using. We'll be working with a dataset containing information on earthquakes that occured between 1965 and 2016.

In [None]:
# modules we'll use
import pandas as pd
import numpy as np
import seaborn as sns
import datetime

# read in our data
earthquakes = pd.read_csv("../input/earthquake-database/database.csv")

# set seed for reproducibility
np.random.seed(0)

In [None]:
earthquakes

# 1) Check the data type of our date column

You'll be working with the "Date" column from the `earthquakes` dataframe.  Investigate this column now: does it look like it contains dates?  What is the dtype of the column?

In [None]:
# TODO: Your code here!
earthquakes["Date"]

Once you have answered the question above, run the code cell below to get credit for your work.

In [None]:
# Check your answer (Run this code cell to receive credit!)
q1.check()

In [None]:
# Line below will give you a hint
#q1.hint()

# 2) Convert our date columns to datetime

Most of the entries in the "Date" column follow the same format: "month/day/four-digit year".  However, the entry at index 3378 follows a completely different pattern.  Run the code cell below to see this.

In [None]:
earthquakes[3378:3383]

In [None]:
earthquakes.loc[2095]

This does appear to be an issue with data entry: ideally, all entries in the column have the same format.  We can get an idea of how widespread this issue is by checking the length of each entry in the "Date" column.

In [None]:
date_lengths = earthquakes.Date.str.len()
date_lengths.value_counts()

Looks like there are two more rows that has a date in a different format.  Run the code cell below to obtain the indices corresponding to those rows and print the data.

In [None]:
indices = np.where([date_lengths == 24])

In [None]:
indices = np.where([date_lengths == 24])[1]
print('Indices with corrupted data:', indices)
earthquakes.loc[indices]

Given all of this information, it's your turn to create a new column "date_parsed" in the `earthquakes` dataset that has correctly parsed dates in it.  

**Note**: When completing this problem, you are allowed to (but are not required to) amend the entries in the "Date" and "Time" columns.  Do not remove any rows from the dataset.

In [None]:
for i in indices:
    earthquakes.loc[i, 'Date'] =  pd.to_datetime(earthquakes.loc[i, 'Date'], format ="ISO8601").date()


earthquakes['date_parsed'] = pd.to_datetime(earthquakes['Date'], format="%m/%d/%Y")


# Check your answer
q2.check()

In [None]:
# Lines below will give you a hint or solution code
#q2.hint()
# q2.solution()

# 3) Select the day of the month

Create a Pandas Series `day_of_month_earthquakes` containing the day of the month from the "date_parsed" column.

In [None]:
# try to get the day of the month from the date column
day_of_month_earthquakes = earthquakes['date_parsed'].dt.day

# Check your answer
q3.check()

In [None]:
# Lines below will give you a hint or solution code
#q3.hint()
#q3.solution()

# 4) Plot the day of the month to check the date parsing

Plot the days of the month from your earthquake dataset.

In [None]:
day_of_month_earthquakes

In [None]:
sns.histplot(day_of_month_earthquakes, kde = True )

Does the graph make sense to you?

In [None]:
# Check your answer (Run this code cell to receive credit!)
q4.check()

In [None]:
# Line below will give you a hint
#q4.hint()

# (Optional) Bonus Challenge

For an extra challenge, you'll work with a [Smithsonian dataset](https://www.kaggle.com/smithsonian/volcanic-eruptions) that documents Earth's volcanoes and their eruptive history over the past 10,000 years

Run the next code cell to load the data.

In [None]:
volcanos = pd.read_csv("../input/volcanic-eruptions/database.csv")
volcanos

Try parsing the column "Last Known Eruption" from the `volcanos` dataframe. This column contains a mixture of text ("Unknown") and years both before the common era (BCE, also known as BC) and in the common era (CE, also known as AD).

In [None]:
volcanos['Last Known Eruption'].sample(5)

In [None]:
unk_counts = volcanos['Last Known Eruption'].isin(['Unknown']).sum()
unk_counts

In [None]:
def convert_year(year):
    year_split = year.split()
    if year_split[-1] == 'BCE':
        year = int(year_split[0])
        return -year
    elif year_split[-1] == 'CE':
        year = int(year_split[0])
        return year
    else:
        return np.nan

volcanos['Last Known Eruption'] = volcanos['Last Known Eruption'].apply(convert_year)

In [None]:
volcanos.dropna(subset=['Last Known Eruption'], inplace=True)
volcanos.sort_values(by='Last Known Eruption', ascending=True, inplace=True)

In [None]:
sns.histplot(volcanos['Last Known Eruption'], kde=True)

# (Optional) More practice

If you're interested in graphing time series, [check out this tutorial](https://www.kaggle.com/residentmario/time-series-plotting-optional).

You can also look into passing columns that you know have dates in them  the `parse_dates` argument in `read_csv`. (The documention [is here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html).) Do note that this method can be very slow, but depending on your needs it may sometimes be handy to use.

# Keep going

In the next lesson, learn how to [**work with character encodings**](https://www.kaggle.com/alexisbcook/character-encodings).

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/data-cleaning/discussion) to chat with other learners.*