# Local to GCS Data Pipeline: Reading, Cleaning, and Uploading with Pandas

### Step 1: Read the data from local into a pandas DataFrame.

#### Importing basic libraries need for data loading

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv(r"C:\Users\sushe\Downloads\WHO_COVID19_globaldata.csv", delimiter=';', index_col=False)

In [3]:
df.head()

Unnamed: 0,Date_reported,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,"Cumulative_deaths,"
0,05/01/2020,AF,Afghanistan,EMRO,,0,,0
1,12/01/2020,AF,Afghanistan,EMRO,,0,,0
2,19/01/2020,AF,Afghanistan,EMRO,,0,,0
3,26/01/2020,AF,Afghanistan,EMRO,,0,,0
4,02/02/2020,AF,Afghanistan,EMRO,,0,,0


## Step 2: Data Cleaning

In [4]:
df.columns = df.columns.str.strip()
df.iloc[:, -1] = df.iloc[:, -1].astype(str).str.rstrip(',')

In [5]:
df = df.rename(columns={'Cumulative_deaths,': 'Cumulative_deaths'})

print(df.head())

  Date_reported Country_code      Country WHO_region  New_cases  \
0    05/01/2020           AF  Afghanistan       EMRO        NaN   
1    12/01/2020           AF  Afghanistan       EMRO        NaN   
2    19/01/2020           AF  Afghanistan       EMRO        NaN   
3    26/01/2020           AF  Afghanistan       EMRO        NaN   
4    02/02/2020           AF  Afghanistan       EMRO        NaN   

   Cumulative_cases  New_deaths Cumulative_deaths  
0                 0         NaN                 0  
1                 0         NaN                 0  
2                 0         NaN                 0  
3                 0         NaN                 0  
4                 0         NaN                 0  


In [6]:
df.fillna(0, inplace=True) 

In [7]:
df = df.drop(columns=['New_cases'], errors='ignore')  

In [8]:
df.head()

Unnamed: 0,Date_reported,Country_code,Country,WHO_region,Cumulative_cases,New_deaths,Cumulative_deaths
0,05/01/2020,AF,Afghanistan,EMRO,0,0.0,0
1,12/01/2020,AF,Afghanistan,EMRO,0,0.0,0
2,19/01/2020,AF,Afghanistan,EMRO,0,0.0,0
3,26/01/2020,AF,Afghanistan,EMRO,0,0.0,0
4,02/02/2020,AF,Afghanistan,EMRO,0,0.0,0


In [9]:
df.tail()

Unnamed: 0,Date_reported,Country_code,Country,WHO_region,Cumulative_cases,New_deaths,Cumulative_deaths
57835,14/07/2024,ZW,Zimbabwe,AFRO,266385,0.0,5740
57836,21/07/2024,ZW,Zimbabwe,AFRO,266386,0.0,5740
57837,28/07/2024,ZW,Zimbabwe,AFRO,266386,0.0,5740
57838,04/08/2024,ZW,Zimbabwe,AFRO,266386,0.0,5740
57839,11/08/2024,ZW,Zimbabwe,AFRO,266386,0.0,5740


In [10]:
df.dtypes

Date_reported         object
Country_code          object
Country               object
WHO_region            object
Cumulative_cases       int64
New_deaths           float64
Cumulative_deaths     object
dtype: object

In [11]:
df.shape

(57840, 7)

In [12]:
df = df.drop(columns=['WHO_region', 'Country_code'], errors='ignore')
print(df.head())

  Date_reported      Country  Cumulative_cases  New_deaths Cumulative_deaths
0    05/01/2020  Afghanistan                 0         0.0                 0
1    12/01/2020  Afghanistan                 0         0.0                 0
2    19/01/2020  Afghanistan                 0         0.0                 0
3    26/01/2020  Afghanistan                 0         0.0                 0
4    02/02/2020  Afghanistan                 0         0.0                 0


In [13]:
df.head()

Unnamed: 0,Date_reported,Country,Cumulative_cases,New_deaths,Cumulative_deaths
0,05/01/2020,Afghanistan,0,0.0,0
1,12/01/2020,Afghanistan,0,0.0,0
2,19/01/2020,Afghanistan,0,0.0,0
3,26/01/2020,Afghanistan,0,0.0,0
4,02/02/2020,Afghanistan,0,0.0,0


## Step 3: Saving the cleaned data to a local file.

In [14]:
df.to_csv(r'C:\Users\sushe\Documents\HarshaProjectsDE\GCP-ETL-Project\gcs-to-bigquery-etl-pipeline\Dataset\CleanedData.csv', index=False)

## Step 4: Upload the cleaned file back to GCS.

In [15]:
!pip install google-cloud-storage
!pip install google-auth



In [16]:
!gcloud auth login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=6jbKYYa5OuZuWiLSz8ttG7XMELfVr2&access_type=offline&code_challenge=J4wffX-DSKuWBoVDGRHV_j323ukPCoebgY37CEGqKWo&code_challenge_method=S256


You are now logged in as [harshavardhan03467@gmail.com].
Your current project is [caramel-duality-452616-e6].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


In [17]:
from google.cloud import storage
import os
storage_client = storage.Client()

#### Created the bucket Inside GCS called **covid19data202024** and then loaded the dataset **WHO_COVID19_globaldata.csv** into the GCS bucket 

In [18]:
bucket_name = "covid19data202024"

In [19]:
local_file_path = r"C:\Users\sushe\Documents\HarshaProjectsDE\GCP-ETL-Project\gcs-to-bigquery-etl-pipeline\Dataset\CleanedData.csv"

In [20]:
destination_blob_name = "WHO_COVID19_global_Clean_data.csv"

In [21]:
bucket = storage_client.bucket(bucket_name)

In [22]:
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(local_file_path)

In [23]:
print(f"File {local_file_path} uploaded to {destination_blob_name}.")

File C:\Users\sushe\Documents\HarshaProjectsDE\GCP-ETL-Project\gcs-to-bigquery-etl-pipeline\Dataset\CleanedData.csv uploaded to WHO_COVID19_global_Clean_data.csv.


#### Finally, the data has been loaded to GCS. We can directly upload the data from localto GCS using Upload file option. However, I wanna try it using python inside jupyter notebook

The motive behind loading data from local to GCS in this project is to **automate the process of securely backing up and centralizing data in the cloud**. By doing so, you ensure that your data is safely stored, accessible from anywhere, and ready for scalable processing or analysis. It also eliminates the risk of data loss on local storage and makes it easier to share or integrate with other cloud-based services or applications.