# Demographic and Behavioral Characteristics Among HIV Incident Cases Diagnosed Since 2016 – Illinois
- This Table represents all new diagnoses with HIV regardless of the stage of the disease [HIV (non-AIDS) or AIDS], and also is referred    to as “HIV infection” or “HIV disease.”
- The Table divides the patients on different Ethics/Ethnicity and categorises them evenly
- This table contains 8 columns.
  1. Cumulative cases Diagnosed Since 2016 - Race/Ethnicity
  2. Total Cases
  3. Percent of Total Cases
  4. Case rate
  5. Total Dealths
  6. Male Cases
  7. percent of Male Cases
  8. Male Case Rate
- using selenium for webscraping I am going to extract the table data save it in a csv format.
- Perform data cleaning to remove unwanted columns
- Perform Extrapolatory Data Analysis to find and drive Insight on the data
- finally perform Visualization on the data to find and undestand the data for better storytelling

In [1]:
import csv
import time
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
plt.style.use('seaborn-v0_8-whitegrid')
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [24]:
driver = webdriver.Chrome()
driver.maximize_window
driver.get(
    url="https://dph.illinois.gov/topics-services/diseases-and-conditions/hiv-aids/hiv-surveillance/update-reports/2023/february.html")
wait = WebDriverWait(driver, 10)
Table = wait.until(EC.presence_of_element_located((By.XPATH, "(//table[@id='DataTables_Table_12'])[1]")))

# getting the number of rows
Rows = Table.find_elements(By.XPATH, ".//tbody/tr")
Rows_Count = len(Rows)
print(f"Row Count: {Rows_Count}")

data = []
for _ in Rows:
    cells = _.find_elements(By.XPATH, ".//td")
    Row = [cell.text for cell in cells]
    data.append(Row)

print()
for _ in data:
    print(_)

# saving the data on a csv format
with open("Hiv and Aids Demographics and Behavioral Characteristics.csv", "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    if data:
        writer.writerow(['Cumulative cases Diagnosed Since 2016', 'Total Cases', 'Percent of Total Cases',
                        'Case rate', 'Total Deaths', 'Male Cases', 'Percent of Male Cases', 'Male Case Rate', 
                         'Female Cases', 'Percent Of Female Cases', 'Female Case Rate'])
        writer.writerows(data)

print()
print('Data saved to csv!')
    
time.sleep(10)
driver.quit()

Row Count: 6

['Total', '9508', '100.00%', '10.3', '195', '7864', '100.00%', '17.36', '', '', '']
['White, non-Hispanic', '1981', '20.84%', '3.41', '63', '1726', '21.95%', '6.06', '', '', '']
['Black, non-Hispanic', '4510', '47.43%', '34.32', '98', '3456', '43.95%', '56.18', '', '', '']
['Hispanic, all races', '2233', '23.49%', '14.83', '25', '2038', '25.92%', '26.22', '', '', '']
['Other', '601', '6.32%', '10.16', '8', '489', '6.22%', '17.07', '', '', '']
['Unknown', '183', '1.92%', '', '1', '155', '1.97%', '', '', '', '']

Data saved to csv!


In [128]:
df = pd.read_csv('February 2023 HIV Surveillance Update Report.csv')
dfCopy = df.copy()
dfCopy

Unnamed: 0,Cumulative Cases Diagnosed Since 2016,Total Cases,Percent of Total Cases,Case Rate,Total Deaths,Male Cases,Percent of Male Cases,Male Case Rate,Female Cases,Percent of Female Cases,Female Case Rate
0,Total,9508,100.00%,10.3,195,7864,100.00%,17.36,1644,100.00%,3.5
1,"White, non-Hispanic",1981,20.84%,3.41,63,1726,21.95%,6.06,255,15.51%,0.86
2,"Black, non-Hispanic",4510,47.43%,34.32,98,3456,43.95%,56.18,1054,64.11%,15.08
3,"Hispanic, all races",2233,23.49%,14.83,25,2038,25.92%,26.22,195,11.86%,2.68
4,Other,601,6.32%,10.16,8,489,6.22%,17.07,112,6.81%,3.67
5,Unknown,183,1.92%,,1,155,1.97%,,28,1.70%,


# Extrapolatory Data Analysis 
- we are performing EDA on the data to remove, identify outliers on the data.
- Find insight on the data by performing data distribution.
- Cleaning, removing duplicates and handling null values
- Lastly performing visualization on the data to
  1) Get a better storytelling
  2) Find correlation on the numerical data

In [129]:
# No of rows, columns
rows, columns = dfCopy.shape
print(f"Rows: {rows}, Columns: {columns}")

Rows: 6, Columns: 11


In [130]:
dfCopy.describe()

Unnamed: 0,Total Cases,Case Rate,Total Deaths,Male Cases,Male Case Rate,Female Cases,Female Case Rate
count,6.0,5.0,6.0,6.0,5.0,6.0,5.0
mean,3169.333333,14.604,65.0,2621.333333,24.578,548.0,5.158
std,3457.729351,11.751074,73.400272,2826.792081,19.0576,652.411833,5.657351
min,183.0,3.41,1.0,155.0,6.06,28.0,0.86
25%,946.0,10.16,12.25,798.25,17.07,132.75,2.68
50%,2107.0,10.3,44.0,1882.0,17.36,225.0,3.5
75%,3940.75,14.83,89.25,3101.5,26.22,854.25,3.67
max,9508.0,34.32,195.0,7864.0,56.18,1644.0,15.08


In [134]:
dfCopy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 11 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Cumulative Cases Diagnosed Since 2016  6 non-null      object 
 1   Total Cases                            6 non-null      int64  
 2   Percent of Total Cases                 6 non-null      float64
 3   Case Rate                              5 non-null      float64
 4   Total Deaths                           6 non-null      int64  
 5   Male Cases                             6 non-null      int64  
 6   Percent of Male Cases                  6 non-null      float64
 7   Male Case Rate                         5 non-null      float64
 8   Female Cases                           6 non-null      int64  
 9   Percent of Female Cases                6 non-null      float64
 10  Female Case Rate                       5 non-null      float64
dtypes: float64

In [132]:
# getting null values
dfCopy.isna().sum()

Cumulative Cases Diagnosed Since 2016    0
Total Cases                              0
Percent of Total Cases                   0
Case Rate                                1
Total Deaths                             0
Male Cases                               0
Percent of Male Cases                    0
Male Case Rate                           1
Female Cases                             0
Percent of Female Cases                  0
Female Case Rate                         1
dtype: int64

In [133]:
# converting object columns to float
# dfCopy['Percent of Female Cases'] = [float(str(i).replace("%","")) for i in dfCopy['Percent of Female Cases']]
# dfCopy['Percent of Total Cases'] = [float(str(i).replace("%","")) for i in dfCopy['Percent of Total Cases']]
dfCopy['Percent of Male Cases'] = [float(str(x).replace("%","")) for x in dfCopy['Percent of Male Cases']]

In [144]:
# fillling the null values
# dfCopy['Case Rate'] = dfCopy['Case Rate'].fillna(dfCopy['Case Rate'].ffill())
# dfCopy['Male Case Rate'] = dfCopy['Male Case Rate'].fillna(dfCopy['Male Case Rate'].ffill())
dfCopy['Female Case Rate'] = dfCopy['Female Case Rate'].fillna(dfCopy['Female Case Rate'].ffill())

In [151]:
# renaming the columns
dfCopy  = dfCopy.rename(
    columns={"Cumulative Cases Diagnosed 2016":"Cumulative Cases Diagnosed in 2016",
            "Percent of Total Cases":"Total Cases %","Percent of Male Cases":"Male Cases %",
            "Percent of Female Cases":"Female Cases %"})

In [152]:
dfCopy

Unnamed: 0,Cumulative Cases Diagnosed in 2016,Total Cases,Total Cases %,Case Rate,Total Deaths,Male Cases,Male Cases %,Male Case Rate,Female Cases,Female Cases %,Female Case Rate
0,Total,9508,100.0,10.3,195,7864,100.0,17.36,1644,100.0,3.5
1,"White, non-Hispanic",1981,20.84,3.41,63,1726,21.95,6.06,255,15.51,0.86
2,"Black, non-Hispanic",4510,47.43,34.32,98,3456,43.95,56.18,1054,64.11,15.08
3,"Hispanic, all races",2233,23.49,14.83,25,2038,25.92,26.22,195,11.86,2.68
4,Other,601,6.32,10.16,8,489,6.22,17.07,112,6.81,3.67
5,Unknown,183,1.92,10.16,1,155,1.97,17.07,28,1.7,3.67


In [136]:
dfCopy.to