# **Data Collection Notebook**

## Objectives

*   Fetch data from Kaggle
*   Updload geospatial data
*   Combine both and save

## Inputs

*  geospatial data

## Outputs

* Dataset to be processed: WeatherAU.csv

## Additional Comments | Insights | Conclusions


* Geospatial data needed to be engineered separately. It didnt have all cities mapped to the main dataset



---

# Install and Import packages

* You eventually will need to restart runtime when installing packages, please note cell output when installing a package

In [None]:
# ! pip install   xxxx

In [None]:
# Code for restarting the runtime (that will restart colab session, all your variables will be lost)
import os
os.kill(os.getpid(), 9)

# **Connection between: Colab Session and your GitHub Repo**

### Insert your **credentials**

* The variable's content will exist only while the session exists. Once this session terminates, the variable's content will be lost.

In [1]:
from getpass import getpass
import os
from IPython.display import clear_output 
print("=== Insert your credentials === \nType in and hit Enter")
UserName = getpass('GitHub User Name: ')
UserEmail = getpass('GitHub User E-mail: ')
RepoName = getpass('GitHub Repository Name: ')
UserPwd = getpass('GitHub Account Password: ')
clear_output()
print("* Thanks for inserting your credentials!")
print(f"* You may now Clone your Repo to this Session, "
      f"then Connect this Session to your Repo.")

* Thanks for inserting your credentials!
* You may now Clone your Repo to this Session, then Connect this Session to your Repo.


---

### **Clone** your GitHub Repo to your current Colab session

* So you can have access to your project's files

In [2]:
! git clone https://github.com/{UserName}/{RepoName}.git

print("\n")
%cd /content/{RepoName}
print(f"\n\n* Current session directory is:  {os.getcwd()}")
print(f"* You may refresh the session folder to access {RepoName} folder.")

Cloning into 'WalkthroughProject1'...
remote: Enumerating objects: 466, done.[K
remote: Counting objects: 100% (466/466), done.[K
remote: Compressing objects: 100% (387/387), done.[K
remote: Total 466 (delta 244), reused 130 (delta 30), pack-reused 0[K
Receiving objects: 100% (466/466), 16.31 MiB | 5.77 MiB/s, done.
Resolving deltas: 100% (244/244), done.


/content/WalkthroughProject1


* Current session directory is:  /content/WalkthroughProject1
* You may refresh the session folder to access WalkthroughProject1 folder.


---

### **Connect** this Colab session to your GitHub Repo

* So if you need, you can push files generated in this session to your Repo.

In [3]:
!git config --global user.email {UserEmail}
!git config --global user.name {UserName}
!git remote rm origin
!git remote add origin https://{UserName}:{UserPwd}@github.com/{UserName}/{RepoName}.git
print(f"\n\n * The current Colab Session is connected to the following GitHub repo: {UserName}/{RepoName}")
print(" * You can now push new files to the repo.")



 * The current Colab Session is connected to the following GitHub repo: FernandoRocha88/WalkthroughProject1
 * You can now push new files to the repo.


---

### **Push** generated/new files from this Session to GitHub repo

* Git commit

In [25]:
CommitMsg = "added-dataset"
!git add .
!git commit -m {CommitMsg}

[main 8918072] added-dataset
 3 files changed, 0 insertions(+), 0 deletions(-)
 rename inputs/datasets/{WeatherAustralia_raw.csv => collection/WeatherAustralia.csv} (100%)
 rename inputs/datasets/{ => raw}/GeospatialAustralia.csv (100%)
 rename inputs/datasets/{ => raw}/weatherAUS.csv (100%)


* Git Push

In [26]:
!git push origin main

Counting objects: 6, done.
Delta compression using up to 2 threads.
Compressing objects:  25% (1/4)   Compressing objects:  50% (2/4)   Compressing objects:  75% (3/4)   Compressing objects: 100% (4/4)   Compressing objects: 100% (4/4), done.
Writing objects:  16% (1/6)   Writing objects:  33% (2/6)   Writing objects:  50% (3/6)   Writing objects:  66% (4/6)   Writing objects:  83% (5/6)   Writing objects: 100% (6/6)   Writing objects: 100% (6/6), 498 bytes | 498.00 KiB/s, done.
Total 6 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/FernandoRocha88/WalkthroughProject1.git
   1d468cf..8918072  main -> main


---

### **Delete** Cloned Repo from current Session

In [None]:
%cd /content
!rm -rf {RepoName}
print(f"\n * Please refresh session folder to validate that {RepoName} folder was removed from this session.")

---

# Fecth data from Kaggle

* Make sure kaggle package is installed. In a Colab session, it normally should be. In case it is not, run the following command in a code cell: **! pip install -q kaggle**

In [None]:
pip show kaggle

---

* You first need to download to your machine a **json file (authentication token)** from Kaggle for authentication. 
* The process is:
  1. From the site header, click on your user profile picture, then on “My Account” from the dropdown menu. This will take you to your account settings. Scroll down to the section of the page labelled API:
  2. Click Expire API Token to remove previous tokens
  3. To create a new token, click on the “Create New API Token” button. It will generate a fresh authentication token and will download kaggle.json file on your machine.
  

* In case you find any difficulty, go to "Authentication" section in this [link](https://www.kaggle.com/docs/api).



* In the end, you should have this file saved locally in your machine. **Please make sure this file is labelled as kaggle.json**


* Upload to this Colab session your kaggle.json file
* Once you run the cell below, Click on "Choose Files", find your kaggle.json file and select it

In [None]:
from google.colab import files
files.upload()

import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from the Kaggle url. When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ . You should copy that at KaggleDatasetPath.
* Set your destination folder.

In [None]:
KaggleDatasetPath = "jsphyg/weather-dataset-rattle-package"
DestinationFolder = "inputs/datasets/raw"   
!kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

* Unzip the downloaded file, delete the zip file and delete kaggle.json file

In [None]:
!unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

* Well done! You can now push the changes to your GitHub Repo, using the Git commands (git add, git commit, git push)
* The codes for executing that are in the section **"Connection between: Colab Session and your GitHub Repo"**

---

# Load Weather data

In [16]:
import pandas as pd
df = pd.read_csv("/content/WalkthroughProject1/inputs/datasets/raw/weatherAUS.csv")
df.shape

(145460, 23)

# Get spatial data

* The raw data, with a map for australian cities vs gps coordinates and state, was downloaded from: https://simplemaps.com/data/au-cities
* However this dataset didnt have all locations present at weatherAUS.csv (like Uluru, PerthAirport, MelbourneAirport etc). We kindly added this information for you, so you dont have to worry about it for this project.

In [18]:
df_spatial = pd.read_csv("/content/WalkthroughProject1/inputs/datasets/raw/GeospatialAustralia.csv")
df_spatial

Unnamed: 0,city,lat,lng,country,iso2,admin_name,capital,population,population_proper
0,Aberdare,-32.842800,151.380300,Australia,AU,New South Wales,,2473,2473
1,Aberdeen,-32.165000,150.901100,Australia,AU,New South Wales,,2084,2084
2,Abermain,-32.807200,151.427500,Australia,AU,New South Wales,,2337,2337
3,Adelaide,-34.928900,138.601100,Australia,AU,South Australia,admin,1345777,1295714
4,Agnes Water,-24.212500,151.903200,Australia,AU,Queensland,,2210,2210
...,...,...,...,...,...,...,...,...,...
1046,Walpole,-34.970000,116.728000,Australia,AU,Western Australia,,0,0
1047,Watsonia,-37.711000,145.083000,Australia,AU,Victoria,,0,0
1048,Williamtown,-32.808997,151.838997,Australia,AU,New South Wales,,0,0
1049,Witchcliffe,-34.026000,115.100000,Australia,AU,Western Australia,,0,0


* Let's filter the most relevant variables and rename

In [19]:
df_spatial = (df_spatial
              .filter(['city', 'lat', 'lng', 'admin_name'])
              .rename(mapper={
                              "city":"Location",
                              "lat":"Latitude",
                              "lng":"Longitude",
                              "admin_name":"State"},
                      axis=1)
              )

df_spatial

Unnamed: 0,Location,Latitude,Longitude,State
0,Aberdare,-32.842800,151.380300,New South Wales
1,Aberdeen,-32.165000,150.901100,New South Wales
2,Abermain,-32.807200,151.427500,New South Wales
3,Adelaide,-34.928900,138.601100,South Australia
4,Agnes Water,-24.212500,151.903200,Queensland
...,...,...,...,...
1046,Walpole,-34.970000,116.728000,Western Australia
1047,Watsonia,-37.711000,145.083000,Victoria
1048,Williamtown,-32.808997,151.838997,New South Wales
1049,Witchcliffe,-34.026000,115.100000,Western Australia


* Does spatial dataset cover all cities from WeatherAUS dataset?

In [20]:
count = 0
list_of_cities = []
for city_df in df.sort_values(by='Location')['Location'].unique():
  if city_df not in df_spatial.sort_values(by='Location')['Location'].unique():
    count +=1
    print(f"{city_df}")

print(f"\n\n* There are {count} cities that are not mapped \n\n")




* There are 0 cities that are not mapped 




# Combining both datasets

In [21]:
df_combination = df.merge(right=df_spatial, how='left',on='Location')
df_combination.head(3)

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,Latitude,Longitude,State
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No,-36.0806,146.9158,New South Wales
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No,-36.0806,146.9158,New South Wales
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No,-36.0806,146.9158,New South Wales


* Evaluating datasets shape

In [22]:
print(f"* df_combination.shape {df_combination.shape} \n"
      f"* df.shape {df.shape} \n"
      f"* df_spatial.shape {df_spatial.shape}")

* df_combination.shape (145460, 26) 
* df.shape (145460, 23) 
* df_spatial.shape (1051, 4)


In [23]:
print(f"* There are {df['Location'].nunique()} unique cities at df dataset. \n"
      f"* There are {df_spatial['Location'].nunique()} unique cities at df_spatial dataset")

* There are 49 unique cities at df dataset. 
* There are 1044 unique cities at df_spatial dataset


# Saving final dataset and pushing to Repo

In [24]:
df_combination.to_csv("/content/WalkthroughProject1/inputs/datasets/collection/WeatherAustralia.csv",index=False)

* Well done! You can now push the changes to your GitHub Repo, using the Git commands (git add, git commit, git push)
* The codes for executing that are in the section "Connection between: Colab Session and your GitHub Repo"