# **Data Collection Notebook**

## Objectives

*   Fetch data from Kaggle
*   Fetch geospatial data
*   Combine both and save.

## Inputs

*  geospatial data **(I want this dataset to be pulled with SQL or to pull as an api)**

## Outputs

* Dataset outputs/datasets/collection/WeatherAustralia.csv

## Additional Comments | Insights | Conclusions


* Geospatial data is provided in a proper format for you
  * As a raw format, it needed to be engineered separately. It didnt have all cities mapped to the dataset downloaded from Kaggle (WeatherAUS). This task was manual and is already done, so you dont have to worry about it.



---

# Install and Import packages

* You eventually will need to restart runtime when installing packages, please note cell output when installing a package

In [None]:
# this notebook doesnt need to install/update packages


# Code for restarting the runtime, that will restart colab session
# It is a good practice after you install a package in a colab session
import os
os.kill(os.getpid(), 9)

* If you want to see which packages the session provides

In [None]:
!pip freeze

# Setup GPU

* The notebook is setup already to use GPU, however, it is good to remind you the process

* Go to Edit → Notebook Settings
* In the Hardware accelerator menu, selects GPU
* note: when you select an option, either GPU, TPU or None, you switch among kernels/sessions

---
* How to know if I am using the GPU?
  * run the code below, if the output is different than '0' or null/nothing, you are using GPU in this session


In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

# **Connection between: Colab Session and your GitHub Repo**

### Insert your **credentials**

* The variable's content will exist only while the session exists. Once this session terminates, the variable's content will be erased permanently.

In [None]:
from getpass import getpass
import os
from IPython.display import clear_output 

print("=== Insert your credentials === \nType in and hit Enter")
os.environ['UserName'] = getpass('GitHub User Name: ')
os.environ['UserEmail'] = getpass('GitHub User E-mail: ')
os.environ['RepoName'] = getpass('GitHub Repository Name: ')
os.environ['UserPwd'] = getpass('GitHub Account Password: ')
clear_output()
print("* Thanks for inserting your credentials!")
print(f"* You may now Clone your Repo to this Session, "
      f"then Connect this Session to your Repo.")

* **Credentials format disclaimer**: when opening Jupyter notebooks in Colab that are hosted at GitHub, we ask you to not consider special characters in your **password**, like @ ! " # $ % & ' ( ) * + , - . / :;< = > ? @ [\ ]^_ ` { } | ~
  * Otherwise it will not work properly the git push command, since the credentials are concatenated in the command: username:password@github.com/username/repo , the git push command will not work properly when these terms have special characters  

---

### **Clone** your GitHub Repo to your current Colab session

* So you can have access to your project's files

In [None]:
! git clone https://github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git
! rm -rf sample_data   # remove content/sample_data folder, since we dont need it for this project

import os
if os.path.isdir(os.environ['RepoName']):
  print("\n")
  %cd /content/{os.environ['RepoName']}
  print(f"\n\n* Current session directory is:{os.getcwd()}")
  print(f"* You may refresh the session folder to access {os.environ['RepoName']} folder.")
else:
  print(f"\n* The Repo {os.environ['UserName']}/{os.environ['RepoName']} was not cloned."
        f" Please check your Credentials: UserName and RepoName")


---

### **Connect** this Colab session to your GitHub Repo

* So if you need, you can push files generated in this session to your Repo.

In [None]:
! git config --global user.email {os.environ['UserEmail']}
! git config --global user.name {os.environ['UserName']}
! git remote rm origin
! git remote add origin https://{os.environ['UserName']}:{os.environ['UserPwd']}@github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git

import uuid
file_name = "session_connection_test_" + str(uuid.uuid4()) # generates a unique file name
with open(f"{file_name}.txt", "w") as file: file.write("text")
print("=== Testing Session Connectivity to the Repo === \n")
! git add . ; ! git commit -m {file_name + "_added_file"} ; ! git push origin main 
print("\n\n")
os.remove(f"{file_name}.txt")
! git add . ; ! git commit -m {file_name + "_removed_file"}; ! git push origin main

# delete your Credentials (user email and password)
os.environ['UserPwd'] = os.environ['UserEmail'] = ""

* **Credentials format disclaimer**: when opening Jupyter notebooks in Colab that are hosted at GitHub, we ask you to not consider special characters in your **password**, like @ ! " # $ % & ' ( ) * + , - . / :;< = > ? @ [\ ]^_ ` { } | ~
  * Otherwise it will not work properly the git push command, since the credentials are concatenated in the command: username:password@github.com/username/repo , the git push command will not work properly when these terms have special characters 

---

### **Push** generated/new files from this Session to GitHub repo

* Git status

In [None]:
! git status

* Git commit

In [None]:
CommitMsg = "update"
! git add .
! git commit -m {CommitMsg}

* Git Push

In [None]:
!git push origin main

---

### **Delete** Cloned Repo from current Session

In [None]:
%cd /content
!rm -rf {os.environ['RepoName']}
print(f"\n * Please refresh session folder to validate that {os.environ['RepoName']} folder was removed from this session.")
print(f"\n\n* Current session directory is:  {os.getcwd()}")

---

# Fecth data from Kaggle

* Make sure kaggle package is installed. In a Colab session, it normally should be. In case it is not, run the following command in a code cell: **! pip install -q kaggle**

In [None]:
pip show kaggle

---

* You first need to download to your machine a **json file (authentication token)** from Kaggle for authentication. 
* The process is:
  1. From the site header, click on your user profile picture, then on “My Account” from the dropdown menu. This will take you to your account settings. Scroll down to the section of the page labelled API:
  2. Click Expire API Token to remove previous tokens
  3. To create a new token, click on the “Create New API Token” button. It will generate a fresh authentication token and will download kaggle.json file on your machine.
  

* In case you find any difficulty, go to "Authentication" section in this [link](https://www.kaggle.com/docs/api).



* In the end, you should have this file saved locally in your machine. **Please make sure this file is labelled as kaggle.json**


* Upload to this Colab session your kaggle.json file
* Once you run the cell below, Click on "Choose Files", find your kaggle.json file and select it

In [None]:
from google.colab import files
files.upload()

import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from the Kaggle url. When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ . You should copy that at KaggleDatasetPath.
* Set your destination folder.

In [None]:
KaggleDatasetPath = "jsphyg/weather-dataset-rattle-package"
DestinationFolder = "inputs/datasets/raw"   
!kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

* Unzip the downloaded file, delete the zip file and delete kaggle.json file

In [None]:
!unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

* Well done! You can now push the changes to your GitHub Repo, using the Git commands (git add, git commit, git push)
* The codes for executing that are in the section **"Connection between: Colab Session and your GitHub Repo"**

---

# Load Kaggle data

In [None]:
import pandas as pd
df = pd.read_csv("/content/WalkthroughProject/inputs/datasets/raw/weatherAUS.csv")
df.info()

* Renaming 'Rainfall' to 'RainfallToday'

In [None]:
df.rename(mapper={'Rainfall':'RainfallToday'},axis=1,inplace=True)
df.columns

* Set Date as datetime format

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

* Are all dates in the proper sequence (with no missing DATES) for a given citiy?

In [None]:
given_city = 'Albury'
df_albury = df.query(f"Location == '{given_city}'").copy()

df_no_missing_date = pd.DataFrame(
    data={"Date":pd.date_range(start = df_albury['Date'].min(), end = df_albury['Date'].max() )})

print(f"* df_albury shape: {df_albury.shape} \n"
      f"* df_no_missing_date shape:{df_no_missing_date.shape} \n"
      f"* It means there are {len(df_no_missing_date) - len(df_albury)} days as missing dates for {given_city}. \n"
      f"* We should add these missing dates, before adding adding 'RainfallTomorrow'")

* Does it happend only to Albury, or with other cities?

In [None]:
df_analysis = pd.DataFrame([])

for city in df['Location'].unique():  
  dfCity = df.query(f"Location == '{city}'").copy()
  
  dfAux = pd.DataFrame(
      data={"RowsFullRange":len(pd.date_range(start = dfCity['Date'].min(), end = dfCity['Date'].max())),
            "RowsOriginal": len(dfCity)
      },
      index=[city])
  dfAux['Difference'] = dfAux["RowsFullRange"] - dfAux["RowsOriginal"]
  df_analysis =df_analysis.append(dfAux)
  
df_analysis[['Difference']].hist(bins=50,figsize=(10,4));

* Only 3 cities dont have missing DATES!

In [None]:
df_analysis.sort_values(by='Difference').head()

* Add 'RainfallTomorrow' and 'RainYesterday'

In [None]:
def AddRainfallTomorrowAndRainYesterday(df):
  df_final = pd.DataFrame([])
  
  for city in df['Location'].unique():  # loops on all cities

    # subset data from given city
    dfCity = df.query(f"Location == '{city}'").copy()
    
    # create dataframe with no missing date. It will have one column only
    df_city_all_dates = pd.DataFrame(
        data={"Date":pd.date_range(start = dfCity['Date'].min(), end = dfCity['Date'].max() )})
    

    # combine both (it will create many missing values, but there will be no missing dates)
    df_city_all_dates = df_city_all_dates.merge(right=dfCity, how='left', on='Date', sort=True)


    # Create RainfallTomorrow level, and RainYesterday
    df_city_all_dates['RainfallTomorrow'] = df_city_all_dates['RainfallToday'].shift(-1)
    # df_city_all_dates['RainYesterday'] = df_city_all_dates['RainToday'].shift(1)


    # remove days where there is no data collecion from a given city
    df_city_all_dates.dropna(subset=['Location'],inplace=True)

    # append to final df
    df_final = df_final.append(df_city_all_dates)
  
  df_final.reset_index(drop=True, inplace=True)
  return df_final

df = AddRainfallTomorrowAndRainYesterday(df)
print(df.shape)
df.head(3)

# Load spatial data

* The raw data, with a map for australian cities vs gps coordinates and state, was downloaded from: https://simplemaps.com/data/au-cities
* However this dataset didnt have all locations present at weatherAUS.csv (like Uluru, PerthAirport, MelbourneAirport etc). We kindly added this information for you, so you dont have to worry about it for this project.

In [None]:
spatial_data_link = (f"https://raw.githubusercontent.com/{os.environ['UserName']}/{os.environ['RepoName']}/"
                     f"main/inputs/datasets/raw/GeospatialAustralia.csv")

df_spatial = pd.read_csv(spatial_data_link)
df_spatial

* Let's filter the most relevant variables for this project and rename

In [None]:
df_spatial = (df_spatial
              .filter(['city', 'lat', 'lng', 'admin_name'])
              .rename(mapper={"city":"Location",
                              "lat":"Latitude",
                              "lng":"Longitude",
                              "admin_name":"State"},axis=1)
              )

df_spatial

* Does spatial dataset cover all cities from WeatherAUS dataset?

In [None]:
count = 0
list_of_cities = []
for city_df in df.sort_values(by='Location')['Location'].unique():
  if city_df not in df_spatial.sort_values(by='Location')['Location'].unique():
    count +=1
    print(f"{city_df}")

print(f"\n\n* There are {count} cities that are not mapped \n\n")

# Combining both datasets

In [None]:
df_combination = df.merge(right=df_spatial, how='left',on='Location')
df_combination.head(3)

* Evaluating datasets shape

In [None]:
print(f"* df_combination.shape {df_combination.shape} \n"
      f"* df.shape {df.shape} \n"
      f"* df_spatial.shape {df_spatial.shape}")

* Check if there is missing data in the merged columns

In [None]:
df_combination.filter(['Location','Latitude', 'Longitude','State']).isna().sum()

* Check how many unieque cities are in the kaggle dataset (df) and geo spatial dataset (df_spatial)

In [None]:
print(f"* There are {df['Location'].nunique()} unique cities at df dataset. \n"
      f"* There are {df_spatial['Location'].nunique()} unique cities at df_spatial dataset")

# Saving final dataset and pushing to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df_combination.to_csv("/content/WalkthroughProject/outputs/datasets/collection/WeatherAustralia.csv",index=False)

* Well done! You can now push the changes to your GitHub Repo, using the Git commands (git add, git commit, git push)
  * The codes for executing that are in the section "Connection between: Colab Session and your GitHub Repo"
* Then, save this notebook at your GitHub repo