
# Final Project

This is the final project of Kiara Martin for the Peckham Digital Accelerator Zone: Introduction to Data Science course. 

I will be comparing UK weather sunshine and rainfall data from https://www.metoffice.gov.uk/research/climate/maps-and-data/uk-and-regional-series and comparing it against UK diagnosed sexually transmitted infection from https://www.gov.uk/government/statistics/sexually-transmitted-infections-stis-annual-data-tables to identify a possible correlation. If identified this data would be useful to UK public health authourites in combating the increase sexually transmitted infections. This could allow for targetted public health promotions in line with weather preditions.

Factors that could interfer with this data include the Covid-19 pandemic ranging from the years 2020 to 2021. The data sourced states that there 'is notably lower than previous years due to the reconfiguration of sexual health services during the national response to the COVID-19 pandemic.'

It is noted that the 'Data presented in relation to gender identity may, or may not, be the same as sex registered at birth. The data total may include people who are gender diverse or those reported with an unknown gender. Therefore the sum of data for men and women may not equal the data total'. To combat this I will use only the data in the 'Total' rows of the Gender column not the data in the rows labled 'Male' and 'Female'. 

I have narrowed my research focus of sexually transmitted infections to Chlamydia diagnoses by gender identity and age group in people aged 15 to 24 years in England between 2015 to 2024. 
 
I will narrowing the UK to just England so to match the chlamydia diagnoses by gender identity and age group in people aged 15 to 24 years in England between 2015 to 2024 dataset with the same region covered by the weather sunshine and rainfall data dataset. There is the possiblity to narrow futher via regions however, it is not clear what the exact geographical boundaries of regions are for both data sets for example the boundaries of Midlands and East of England could vary.


## Imports

Importing libraries and data files.
Double checking the pathwaty of the data files.

In [1]:
import pandas as pd # Data manipulation library
import matplotlib.pyplot as plt # Plotting library

In [2]:
sunshine_data_path = 'Raw_data/Sunshine_England.csv' # Path to the sunshine data CSV file
sunshine_df = pd.read_csv(sunshine_data_path) # Load the sunshine data into a DataFrame
sunshine_df # Display the first few rows of the sunshine data

Unnamed: 0,year,ann
0,1910,1373.5
1,1911,1662.3
2,1912,1202.6
3,1913,1254.5
4,1914,1522.3
...,...,...
111,2021,1486.9
112,2022,1741.4
113,2023,1559.3
114,2024,1394.6


In [3]:
rainfall_data_path = 'Raw_data/Rainfall_England.csv' # Path to the rainfall data CSV file
rainfall_df = pd.read_csv(rainfall_data_path) # Load the rainfall data into a DataFrame
rainfall_df # Display the first few rows of the rainfall data


Unnamed: 0,year,ann
0,1836,885.2
1,1837,746.5
2,1838,757.1
3,1839,918.1
4,1840,688.7
...,...,...
185,2021,877.3
186,2022,778.8
187,2023,1066.2
188,2024,1020.1


In [4]:
chlamydia_data_path = 'Raw_data/Chlamydia.csv' # Path to the chlamydia data CSV file
chlamydia_df = pd.read_csv(chlamydia_data_path, thousands=',') # Load the chlamydia data into a DataFrame
chlamydia_df # Display the first few rows of the chlamydia data

Unnamed: 0,Area of residence,Tests or diagnoses,Gender identity\n[note 5],Age,2015,2016,2017,2018,2019,2020 \n[note 10],2021\n[note 10],2022,2023,2024,Percentage change\n2023 to 2024,Unnamed: 15,Unnamed: 16
0,England,Tests,Women,15 to 19,415359.0,365619.0,328458.0,312368.0,311493.0,198308,192416,194795.0,193903.0,170331.0,-12.2,,
1,England,Tests,Women,20 to 24,669788.0,632606.0,599518.0,613386.0,633243.0,478495,500036,495316.0,479199.0,430964.0,-10.1,,
2,England,Tests,Men,15 to 19,147030.0,123265.0,107790.0,103493.0,103960.0,60001,56300,61138.0,61559.0,54074.0,-12.2,,
3,England,Tests,Men,20 to 24,299623.0,280108.0,263885.0,273243.0,279066.0,195062,199774,215446.0,213612.0,190062.0,-11.0,,
4,England,Tests,Total,15 to 19,569147.0,497193.0,442535.0,421986.0,422474.0,262430,255650,267251.0,265638.0,233780.0,-12.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,,,,,,,,,,,,,,,,,
150,,,,,,,,,,,,,,,,,
151,,,,,,,,,,,,,,,,,
152,,,,,,,,,,,,,,,,,


In [5]:
# check the pathway 
import os
print(os.getcwd())

/Users/kiaramartin/Documents/GitHub/daz-intro-to-ds/Final_Project



## Reading and Indentifying data types

In [None]:
sunshine_df.info() # Display information about the Sunshine DataFrame, including data types and non-null counts
sunshine_df.describe() # Display summary statistics for numerical columns in the Sunshine DataFrame
sunshine_df.head() # Display the first few rows of the Sunshine DataFrame

In [None]:
rainfall_df.info() # Display information about the DataFrame, including data types and non-null counts  
rainfall_df.describe() # Display summary statistics for numerical columns in the DataFrame
rainfall_df.head() 

In [None]:
chlamydia_df.info() # Display information about the DataFrame, including data types and non-null counts
chlamydia_df.describe() # Display summary statistics for numerical columns in the DataFrame
chlamydia_df.head() 


## Data Cleaning of Sunshine Data

I will identify and rename the headings for clarity. Then I will convert the columns to be the correct type. I will ensure the 'Year' column is displayed correctly. I will drop rows with missing values followed by rows outside the required timeframe of 2015 - 2024. 


In [None]:
print(sunshine_df.columns)

In [None]:
sunshine_df.rename(columns={'year  ': 'Year', '   ann': 'Annual Sunshine Hours'}, inplace=True) # Rename columns for clarity
sunshine_df.head()  # Display the first few rows of the updated Sunshine DataFrame

In [None]:
sunshine_df['Year'] = pd.to_datetime(sunshine_df['Year'], format='%Y') # Convert 'Year' column to datetime format
sunshine_df['Annual Sunshine Hours'].astype, (float) # Ensure 'Annual Sunshine Hours' is of type float
sunshine_df.info() # Display summary statistics for numerical columns in the updated Sunshine DataFrame

In [None]:
sunshine_df.dropna(inplace=True) # Drop rows with missing values in the Sunshine DataFrame

# Keep only rows where the year is between 1910 and 2014 (inclusive)
sunshine_df = sunshine_df[(sunshine_df['Year'].dt.year >= 2015) & (sunshine_df['Year'].dt.year <= 2024)] # Filter the Sunshine DataFrame for the specified year range
sunshine_df # Display the first few rows of the filtered Sunshine DataFrame

In [None]:
# Need to convert the 'Year' column to display the year only

sunshine_df['Year'] = sunshine_df['Year'].dt.year # Convert 'Year' column to display only the year


In [None]:
sunshine_df # Display the first few rows of the final Sunshine DataFrame after cleaning and formatting


### Reading and Indentifying Sunshine Data Type to confirm cleaning is complete

In [None]:
sunshine_df.info() # Display information about the final Sunshine DataFrame, including data types and non-null counts
sunshine_df.describe() # Display summary statistics for numerical columns in the final Sunshine DataFrame
sunshine_df.head() # Display the first few rows of the final Sunshine DataFrame after cleaning and formatting

In [None]:
# Ensure 'Annual Sunshine Hours' is of type float, previously it was not converted correctly

sunshine_df['Annual Sunshine Hours'] = sunshine_df['Annual Sunshine Hours'].astype(float) # Convert 'Annual Sunshine Hours' to float type
sunshine_df.info() # Display information about the final Sunshine DataFrame after ensuring correct data types

In [None]:
# Data Cleaning and Formatting for Sunshine Dataset is Complete


## Data Cleaning of Rainfall Data

I will identify and rename the headings for clarity. Then I will convert the columns to be the correct type. I will ensure the 'Year' column is displayed correctly. I will drop rows with missing values followed by rows outside the required timeframe of 2015 - 2024. 

In [None]:
print(rainfall_df.columns) # Display the columns of the Rainfall DataFrame

In [None]:
rainfall_df.rename(columns={'year  ': 'Year', '   ann': 'Annual Rainfall (mm)'}, inplace=True) # Rename columns for clarity
rainfall_df.head()  # Display the first few rows of the updated Rainfall DataFrame

In [None]:
rainfall_df['Year'] = pd.to_datetime(rainfall_df['Year'], format='%Y') # Convert 'Year' column to datetime format
rainfall_df['Year'] = rainfall_df['Year'].dt.year # Convert 'Year' column to display only the year
rainfall_df # Display the first few rows of the updated Rainfall DataFrame

In [None]:
rainfall_df['Annual Rainfall (mm)'] = rainfall_df['Annual Rainfall (mm)'].astype(float) # Ensure 'Annual Rainfall (mm)' is of type float
rainfall_df.info() # Display information about the updated Rainfall DataFrame, including data types and non-null counts

In [None]:
rainfall_df.dropna(inplace=True) # Drop rows with missing values in the Rainfall DataFrame
rainfall_df.info() # Display information about the Rainfall DataFrame after dropping rows with missing values
rainfall_df

In [None]:
# 2025 has an empty value but has not been dropped? It is not required, and this row will be dropped in the next step

rainfall_df = rainfall_df[(rainfall_df['Year'] >= 2015) & (rainfall_df['Year'] <= 2024)] # Filter the Rainfall DataFrame for the specified year range
rainfall_df # Display the first few rows of the Rainfall DataFrame after filtering for the

In [None]:
# 2025 has now been dropped successfully

### Reading and Indentifying Rainfall Data Type to confirm cleaning is complete


In [None]:
rainfall_df.info() # Display information about the final Rainfall DataFrame, including data types and non-null counts
rainfall_df.describe() # Display summary statistics for numerical columns in the final Rainfall DataFrame
rainfall_df.head() # Display the first few rows of the final Rainfall DataFrame after

In [None]:
# Data Cleaning and Formatting for Rainfall Dataset is Complete


## Data Cleaning of Chlamydia Data

I will identify and rename the headings for clarity. I will drop rows with missing values followed by rows that are not in England, not Diagnoses, not Total gender identity or not within the 15 - 24 age range. Then I will convert the columns to be the correct type.

In [None]:
print(chlamydia_df.columns) 
chlamydia_df.head() # Display the first few rows of the Chlamydia DataFrame

In [None]:
chlamydia_df.drop (columns=['Unnamed: 15', 'Percentage change\n2023 to 2024', 'Unnamed: 16'], inplace=True) # Drop unnecessary columns
print (chlamydia_df.columns) # Display the updated column names after dropping unnecessary columns

In [None]:
chlamydia_df.rename(columns={'2020 \n[note 10]': '2020', '2021\n[note 10]': '2021', 'Gender identity\n[note 5]':'Gender identity'}, inplace=True) # Rename columns for clarity
chlamydia_df.head() # Display the first few rows of the updated Chlamydia DataFrame