#1 Data cleaning. 

First we work on "data_archive/FlunetData_India_All Sites_for_09 September 1996 to 09 September 2024.csv" which contains information about:

- Country area or territory: This column indicates the country or region from which the data was collected. For example, "India" refers to data collected in India.

- Surveillance site type: This specifies the type of surveillance site where the data was collected. For example, "Non-sentinel" refers to data collected from sites that are not part of a pre-selected sentinel network but may include hospitals or labs.

- Year-week (ISO 8601 calendar): This follows the ISO 8601 standard for representing the year and week number. For example, "2021-32" represents the 32nd week of the year 2021. The ISO 8601 standard is used to ensure consistency in date formatting across countries.

- Week start date (ISO 8601 calendar): This column contains the start date of the week based on the ISO 8601 format, typically in the form YYYY-MM-DD (year-month-day), representing the first day of that week.

- Specimen tested: This column shows the total number of specimens that were tested for influenza in that week or location.

- Influenza positive: This indicates how many of the tested specimens were positive for influenza, meaning the virus was detected.

- Influenza negative: This column displays the number of specimens that tested negative for influenza, meaning the virus was not detected.

- A (H1): This refers to influenza A subtype H1, a strain of the influenza A virus, also known as H1N1.

- A (H1N1)pdm09: This refers to the specific strain of influenza A (H1N1) responsible for the 2009 pandemic, commonly known as the swine flu.

- A (H3): This refers to the H3 subtype of influenza A, which is another common strain responsible for seasonal flu outbreaks.

- A not subtyped: This column shows how many influenza A samples were detected but not subtyped into a specific strain (like H1 or H3).

- B (Victoria): This refers to the Victoria lineage of influenza B, which is one of the two main lineages of the influenza B virus.

- B (Yamagata): This refers to the Yamagata lineage of influenza B, the other main lineage of the influenza B virus.

- B (lineage not determined): This shows how many influenza B samples were detected but could not be determined as belonging to either the Victoria or Yamagata lineages.

In [28]:
import pandas as pd 

infulenza_data = pd.read_csv("data_archive/FlunetData_India_All Sites_for_09 September 1996 to 09 September 2024.csv")
infulenza_data['Week start date (ISO 8601 calendar)'] = pd.to_datetime(infulenza_data['Week start date (ISO 8601 calendar)'])

infulenza_data['Year'] = infulenza_data['Week start date (ISO 8601 calendar)'].dt.year
year_counts = infulenza_data['Year'].value_counts().sort_index()

print("Years and counts (how many datarows within the year):")
print(year_counts)



Years and counts (how many datarows within the year):
Year
1998      1
1999     14
2000     44
2001      7
2002     47
2003     23
2007      1
2008     45
2009     40
2010     50
2011     52
2012     53
2013     52
2014     52
2015     52
2016     48
2017     48
2018     51
2019     52
2020     24
2021     74
2022     99
2023    102
2024     72
Name: count, dtype: int64


In [29]:
# Convert 'Week start date (ISO 8601 calendar)' column to date format
infulenza_data['Week start date (ISO 8601 calendar)'] = pd.to_datetime(data['Week start date (ISO 8601 calendar)'])

# Sort the DataFrame by date in ascending order
infulenza_data_sorted = infulenza_data.sort_values(by='Week start date (ISO 8601 calendar)', ascending=True)

# Display the sorted DataFrame
infulenza_data_sorted

Unnamed: 0.1,Unnamed: 0,Country area or territory,Surveillance site type,Year-week (ISO 8601 calendar),Week start date (ISO 8601 calendar),Specimen tested,Influenza positive,Influenza negative,A (H1),A (H1N1)pdm09,A (H3),A not subtyped,B (Victoria),B (Yamagata),B (lineage not determined),Year
144,145,India,Not defined,1998-31,1998-07-27,1,1,0,,,1.0,,,,,1998
145,146,India,Not defined,1999-38,1999-09-20,0,0,0,,,,0.0,,,0.0,1999
146,147,India,Not defined,1999-39,1999-09-27,0,0,0,,,,0.0,,,0.0,1999
147,148,India,Not defined,1999-40,1999-10-04,0,0,0,,,,0.0,,,0.0,1999
148,149,India,Not defined,1999-41,1999-10-11,0,0,0,,,,0.0,,,0.0,1999
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1100,1101,India,Sentinel,2024-35,2024-08-26,264,50,214,,44.0,5.0,,1.0,,,2024
1101,1102,India,Sentinel,2024-36,2024-09-02,195,33,162,,22.0,7.0,,3.0,,1.0,2024
142,143,India,Non-sentinel,2024-36,2024-09-02,49,8,41,,6.0,2.0,,,,,2024
143,144,India,Non-sentinel,2024-37,2024-09-09,66,13,53,,8.0,5.0,,,,,2024


Data is ready be cleaned. Things to be removed:

-Nan values
-colums that are not important
-missing values with means

In [None]:
#code here















Other dataframe consists of nine weather datasets of nine diffirent cities in India. 

In [30]:
#same kind of data
BangaloreCity= "data_archive/Temperature_And_Precipitation_Cities_IN/Bangalore_1990_2022_BangaloreCity.csv"
Madras="data_archive/Temperature_And_Precipitation_Cities_IN/Chennai_1990_2022_Madras.csv"
Safdarjung= "data_archive/Temperature_And_Precipitation_Cities_IN/Delhi_NCR_1990_2022_Safdarjung.csv"
Lucknow="data_archive/Temperature_And_Precipitation_Cities_IN/Lucknow_1990_2022.csv"
Mumbai="data_archive/Temperature_And_Precipitation_Cities_IN/Mumbai_1990_2022_Santacruz.csv"
Rajasthan="data_archive/Temperature_And_Precipitation_Cities_IN/Rajasthan_1990_2022_Jodhpur.csv"

#diffirent from the others
Bhubhneshwar="data_archive/Temperature_And_Precipitation_Cities_IN/weather_Bhubhneshwar_1990_2022.csv"
Rourkela="data_archive/Temperature_And_Precipitation_Cities_IN/weather_Rourkela_2021_2022.csv"

Geolocations = "data_archive/Temperature_And_Precipitation_Cities_IN/Station_GeoLocation_Longitute_Latitude_Elevation_EPSG_4326.csv"

Here I concat same kind of data together to the same dataframe

In [31]:
# Read each CSV file as a separate DataFrame
BangaloreCity = pd.read_csv("data_archive/Temperature_And_Precipitation_Cities_IN/Bangalore_1990_2022_BangaloreCity.csv")
Madras = pd.read_csv("data_archive/Temperature_And_Precipitation_Cities_IN/Chennai_1990_2022_Madras.csv")
Safdarjung = pd.read_csv("data_archive/Temperature_And_Precipitation_Cities_IN/Delhi_NCR_1990_2022_Safdarjung.csv")
Lucknow = pd.read_csv("data_archive/Temperature_And_Precipitation_Cities_IN/Lucknow_1990_2022.csv")
Mumbai = pd.read_csv("data_archive/Temperature_And_Precipitation_Cities_IN/Mumbai_1990_2022_Santacruz.csv")
Rajasthan = pd.read_csv("data_archive/Temperature_And_Precipitation_Cities_IN/Rajasthan_1990_2022_Jodhpur.csv")

# Add the city name to the corresponding DataFrame
BangaloreCity['City'] = 'Bangalore'
Madras['City'] = 'Chennai'
Safdarjung['City'] = 'Delhi'
Lucknow['City'] = 'Lucknow'
Mumbai['City'] = 'Mumbai'
Rajasthan['City'] = 'Rajasthan'

# Combine all DataFrames into one and reset the index to avoid duplicate indices
weather = pd.concat([BangaloreCity, Madras, Safdarjung, Lucknow, Mumbai, Rajasthan], ignore_index=True)

# Convert the 'time' column to datetime format using the specified format
weather['time'] = pd.to_datetime(weather['time'], format='%d-%m-%Y')

# Sort the DataFrame by 'time' in ascending order
weather_sorted = weather.sort_values(by='time', ascending=True)

# Reset the index to make it continuous after sorting
weather_sorted = weather_sorted.reset_index(drop=True)

# Display the sorted DataFrame
weather_sorted


Unnamed: 0,time,tavg,tmin,tmax,prcp,City
0,1990-01-01,22.9,19.1,28.4,,Bangalore
1,1990-01-01,25.2,22.8,28.4,0.5,Chennai
2,1990-01-01,7.2,,18.1,0.0,Lucknow
3,1990-01-01,23.2,17.0,,0.0,Mumbai
4,1990-01-01,22.9,19.1,28.4,,Rajasthan
...,...,...,...,...,...,...
71359,2022-07-25,27.1,24.1,34.3,0.5,Lucknow
71360,2022-07-25,28.1,25.4,32.6,2.9,Chennai
71361,2022-07-25,24.1,20.2,28.5,0.5,Bangalore
71362,2022-07-25,28.3,25.1,30.2,7.1,Mumbai


I think we should drop dates that are not in the period of 01/01/1990 to 20/07/2022.
