<a href="https://colab.research.google.com/github/FernandoRocha88/WalkthroughProject/blob/main/jupyter_notebooks/01%20-%20DataCollection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Collection Notebook**

## Objectives

*   Fetch data from Kaggle
*   Fetch geospatial data
*   Combine both and save.

## Inputs

*  geospatial data **(I want this dataset to be pulled with SQL or to pull as an api)**

## Outputs

* Dataset outputs/datasets/collection/WeatherAustralia.csv

## Additional Comments | Insights | Conclusions


* Geospatial data is provided in a proper format for you
  * As a raw format, it needed to be engineered separately. It didnt have all cities mapped to the dataset downloaded from Kaggle (WeatherAUS). This task was manual and is already done, so you dont have to worry about it.



---

# Install and Import packages

* You eventually will need to restart runtime when installing packages, please note cell output when installing a package

In [None]:
# this notebook doesnt need to install/update packages

* If you want to see which packages the session provides

In [None]:
!pip freeze

* It is a good practice to restart run time (restart the session), since installing new packages may need it. All your variables will be lost.

In [None]:
import os
os.kill(os.getpid(), 9)

# Setup GPU

* The notebook is setup already to use GPU, however, it is good to remind you the process

* Go to Edit → Notebook Settings
* In the Hardware accelerator menu, selects GPU
* note: when you select an option, either GPU, TPU or None, you switch among kernels/sessions

---
* How to know if I am using the GPU?
  * run the code below, if the output is different than '0' or null/nothing, you are using GPU in this session


In [1]:
import tensorflow as tf
tf.test.gpu_device_name()

'/device:GPU:0'

# **Connection between: Colab Session and your GitHub Repo**

### Insert your **credentials**

* The variable's content will exist only while the session exists. Once this session terminates, the variable's content will be erased permanently.

In [2]:
from getpass import getpass
import os
from IPython.display import clear_output 

print("=== Insert your credentials === \nType in and hit Enter")
os.environ['UserName'] = getpass('GitHub User Name: ')
os.environ['UserEmail'] = getpass('GitHub User E-mail: ')
os.environ['RepoName'] = getpass('GitHub Repository Name: ')
os.environ['UserPwd'] = getpass('GitHub Account Password: ')
clear_output()
print("* Thanks for inserting your credentials!")
print(f"* You may now Clone your Repo to this Session, "
      f"then Connect this Session to your Repo.")

* Thanks for inserting your credentials!
* You may now Clone your Repo to this Session, then Connect this Session to your Repo.


* **Credentials format disclaimer**: when opening Jupyter notebooks in Colab that are hosted at GitHub, we ask you to not consider special characters in the password, like @
  * Otherwise it will not work the git push command, since it is concatenated the username:password@github.com/username/repo , the git push command will not work properly when thse terms have special characters 

---

### **Clone** your GitHub Repo to your current Colab session

* So you can have access to your project's files

In [3]:
! git clone https://github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git
! rm -rf sample_data   # remove content/sample_data folder, since we dont need it for this project

import os
if os.path.isdir(os.environ['RepoName']):
  print("\n")
  %cd /content/{os.environ['RepoName']}
  print(f"\n\n* Current session directory is:{os.getcwd()}")
  print(f"* You may refresh the session folder to access {os.environ['RepoName']} folder.")
else:
  print(f"\n* The Repo {os.environ['UserName']}/{os.environ['RepoName']} was not cloned."
        f" Please check your Credentials: UserName and RepoName")


Cloning into 'WalkthroughProject'...
remote: Enumerating objects: 773, done.[K
remote: Counting objects: 100% (329/329), done.[K
remote: Compressing objects: 100% (273/273), done.[K
remote: Total 773 (delta 198), reused 82 (delta 39), pack-reused 444[K
Receiving objects: 100% (773/773), 28.21 MiB | 8.51 MiB/s, done.
Resolving deltas: 100% (424/424), done.


/content/WalkthroughProject


* Current session directory is:/content/WalkthroughProject
* You may refresh the session folder to access WalkthroughProject folder.


---

### **Connect** this Colab session to your GitHub Repo

* So if you need, you can push files generated in this session to your Repo.

In [4]:
! git config --global user.email {os.environ['UserEmail']}
! git config --global user.name {os.environ['UserName']}
! git remote rm origin
! git remote add origin https://{os.environ['UserName']}:{os.environ['UserPwd']}@github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git

import uuid
file_name = "session_connection_test_" + str(uuid.uuid4()) # generates a unique file name
with open(f"{file_name}.txt", "w") as file: file.write("text")
print("=== Testing Session Connectivity to the Repo === \n")
! git add . ; ! git commit -m {file_name + "_added_file"} ; ! git push origin main 
print("\n\n")
os.remove(f"{file_name}.txt")
! git add . ; ! git commit -m {file_name + "_removed_file"}; ! git push origin main

# delete your Credentials (username and password)
os.environ['UserName'] = os.environ['UserPwd'] = os.environ['UserEmail'] = ""

=== Testing Session Connectivity to the Repo === 

[main 8af64ee] session_connection_test_1b976986-c92b-42c1-9ee3-349865738e10_added_file
 1 file changed, 1 insertion(+)
 create mode 100644 session_connection_test_1b976986-c92b-42c1-9ee3-349865738e10.txt
Counting objects: 3, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 374 bytes | 374.00 KiB/s, done.
Total 3 (delta 1), reused 1 (delta 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.[K
To https://github.com/FernandoRocha88/WalkthroughProject.git
   1d85f7b..8af64ee  main -> main



[main e1db986] session_connection_test_1b976986-c92b-42c1-9ee3-349865738e10_removed_file
 1 file changed, 1 deletion(-)
 delete mode 100644 session_connection_test_1b976986-c92b-42c1-9ee3-349865738e10.txt
Counting objects: 2, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 270 bytes | 270.00 KiB/s, do

* If output above indicates there was a failure in the authentication, please insert again your credentials **(username and password)**

---

### **Push** generated/new files from this Session to GitHub repo

* Git status

In [None]:
! git status

* Git commit

In [None]:
CommitMsg = "update"
! git add .
! git commit -m {CommitMsg}

* Git Push

In [None]:
!git push origin main

---

### **Delete** Cloned Repo from current Session

In [None]:
%cd /content
!rm -rf {os.environ['RepoName']}
print(f"\n * Please refresh session folder to validate that {os.environ['RepoName']} folder was removed from this session.")
print(f"\n\n* Current session directory is:  {os.getcwd()}")

---

# Fecth data from Kaggle

* Make sure kaggle package is installed. In a Colab session, it normally should be. In case it is not, run the following command in a code cell: **! pip install -q kaggle**

In [None]:
pip show kaggle

---

* You first need to download to your machine a **json file (authentication token)** from Kaggle for authentication. 
* The process is:
  1. From the site header, click on your user profile picture, then on “My Account” from the dropdown menu. This will take you to your account settings. Scroll down to the section of the page labelled API:
  2. Click Expire API Token to remove previous tokens
  3. To create a new token, click on the “Create New API Token” button. It will generate a fresh authentication token and will download kaggle.json file on your machine.
  

* In case you find any difficulty, go to "Authentication" section in this [link](https://www.kaggle.com/docs/api).



* In the end, you should have this file saved locally in your machine. **Please make sure this file is labelled as kaggle.json**


* Upload to this Colab session your kaggle.json file
* Once you run the cell below, Click on "Choose Files", find your kaggle.json file and select it

In [None]:
from google.colab import files
files.upload()

import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from the Kaggle url. When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ . You should copy that at KaggleDatasetPath.
* Set your destination folder.

In [None]:
KaggleDatasetPath = "jsphyg/weather-dataset-rattle-package"
DestinationFolder = "inputs/datasets/raw"   
!kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

* Unzip the downloaded file, delete the zip file and delete kaggle.json file

In [None]:
!unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

* Well done! You can now push the changes to your GitHub Repo, using the Git commands (git add, git commit, git push)
* The codes for executing that are in the section **"Connection between: Colab Session and your GitHub Repo"**

---

# Load Kaggle data

In [64]:
import pandas as pd
df = pd.read_csv("/content/WalkthroughProject/inputs/datasets/raw/weatherAUS.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null

* Renaming 'Rainfall' to 'RainfallToday'

In [65]:
df.rename(mapper={'Rainfall':'RainfallToday'},axis=1,inplace=True)
df.columns

Index(['Date', 'Location', 'MinTemp', 'MaxTemp', 'RainfallToday',
       'Evaporation', 'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am',
       'WindDir3pm', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am',
       'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm',
       'Temp9am', 'Temp3pm', 'RainToday', 'RainTomorrow'],
      dtype='object')

* Set date as datetime and as index

In [36]:
# df['Date'] = pd.to_datetime(df['Date'])
# df.set_index(df['Date'],drop=True,inplace=True)

* Are all dates in the proper sequence (with no gaps) for each cities?

In [70]:
dfCity['Date'].min()

Timestamp('2008-12-01 00:00:00')

In [71]:
dfCity = df.query(f"Location == 'Albury'").copy()
dfCity['Date'] = pd.to_datetime(dfCity['Date'])

pd.date_range(start = dfCity['Date'].min(), end = dfCity['Date'].max() ).difference(df.index)

# dfCity['shift'] = dfCity['Date'].shift(-1)

# dfCity.dropna(subset=['shift'],inplace=True)
# df['shift'] = pd.to_datetime(df['shift'])
#.resample('D').mean()

DatetimeIndex(['2008-12-01', '2008-12-02', '2008-12-03', '2008-12-04',
               '2008-12-05', '2008-12-06', '2008-12-07', '2008-12-08',
               '2008-12-09', '2008-12-10',
               ...
               '2017-06-16', '2017-06-17', '2017-06-18', '2017-06-19',
               '2017-06-20', '2017-06-21', '2017-06-22', '2017-06-23',
               '2017-06-24', '2017-06-25'],
              dtype='datetime64[ns]', length=3129, freq=None)

* Add 'RainfallTomorrow'

In [10]:
def AddRainfallTomorrow(df,categ_var='Location'):
  df_final = pd.DataFrame([])

  for city in df[categ_var].unique():
    dfCity = df.query(f"{categ_var} == '{city}'").copy()
    dfCity['RainfallTomorrow'] = df['RainfallToday'].shift(-1)
    df_final = df_final.append(dfCity)

  return df_final

df = AddRainfallTomorrow(df)
df.head(3)

Unnamed: 0,Date,Location,MinTemp,MaxTemp,RainfallToday,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,RainfallTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No,0.0
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No,0.0
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No,0.0


# Load spatial data

* The raw data, with a map for australian cities vs gps coordinates and state, was downloaded from: https://simplemaps.com/data/au-cities
* However this dataset didnt have all locations present at weatherAUS.csv (like Uluru, PerthAirport, MelbourneAirport etc). We kindly added this information for you, so you dont have to worry about it for this project.

In [None]:
df_spatial = pd.read_csv("/content/WalkthroughProject/inputs/datasets/raw/GeospatialAustralia.csv")
df_spatial

* Let's filter the most relevant variables for this project and rename

In [None]:
df_spatial = (df_spatial
              .filter(['city', 'lat', 'lng', 'admin_name'])
              .rename(mapper={"city":"Location",
                              "lat":"Latitude",
                              "lng":"Longitude",
                              "admin_name":"State"},axis=1)
              )

df_spatial

* Does spatial dataset cover all cities from WeatherAUS dataset?

In [None]:
count = 0
list_of_cities = []
for city_df in df.sort_values(by='Location')['Location'].unique():
  if city_df not in df_spatial.sort_values(by='Location')['Location'].unique():
    count +=1
    print(f"{city_df}")

print(f"\n\n* There are {count} cities that are not mapped \n\n")

# Combining both datasets

In [None]:
df_combination = df.merge(right=df_spatial, how='left',on='Location')
df_combination.head(3)

* Evaluating datasets shape

In [None]:
print(f"* df_combination.shape {df_combination.shape} \n"
      f"* df.shape {df.shape} \n"
      f"* df_spatial.shape {df_spatial.shape}")

* Check if there is missing data in the merged columns

In [None]:
df_combination.filter(['Location','Latitude', 'Longitude','State']).isna().sum()

* Check how many unieque cities are in the kaggle dataset (df) and geo spatial dataset (df_spatial)

In [None]:
print(f"* There are {df['Location'].nunique()} unique cities at df dataset. \n"
      f"* There are {df_spatial['Location'].nunique()} unique cities at df_spatial dataset")

# Saving final dataset and pushing to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df_combination.to_csv("/content/WalkthroughProject/outputs/datasets/collection/WeatherAustralia.csv",index=False)

* Well done! You can now push the changes to your GitHub Repo, using the Git commands (git add, git commit, git push)
  * The codes for executing that are in the section "Connection between: Colab Session and your GitHub Repo"
* Then, save this notebook at your GitHub repo