# Demographic and Behavioral Characteristics Among HIV Incident Cases Diagnosed Since 2016 – Illinois
- This Table represents all new diagnoses with HIV regardless of the stage of the disease [HIV (non-AIDS) or AIDS], and also is referred    to as “HIV infection” or “HIV disease.”
- The Table divides the patients on different Ethics/Ethnicity and categorises them evenly
- This table contains 8 columns.
  1. Cumulative cases Diagnosed Since 2016 - Race/Ethnicity
  2. Total Cases
  3. Percent of Total Cases
  4. Case rate
  5. Total Dealths
  6. Male Cases
  7. percent of Male Cases
  8. Male Case Rate
- using selenium for webscraping I am going to extract the table data save it in a csv format.
- Perform data cleaning to remove unwanted columns
- Perform Extrapolatory Data Analysis to find and drive Insight on the data
- finally perform Visualization on the data to find and undestand the data for better storytelling

In [1]:
import csv
import time
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
plt.style.use('seaborn-v0_8-whitegrid')
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [24]:
driver = webdriver.Chrome()
driver.maximize_window
driver.get(
    url="https://dph.illinois.gov/topics-services/diseases-and-conditions/hiv-aids/hiv-surveillance/update-reports/2023/february.html")
wait = WebDriverWait(driver, 10)
Table = wait.until(EC.presence_of_element_located((By.XPATH, "(//table[@id='DataTables_Table_12'])[1]")))

# getting the number of rows
Rows = Table.find_elements(By.XPATH, ".//tbody/tr")
Rows_Count = len(Rows)
print(f"Row Count: {Rows_Count}")

data = []
for _ in Rows:
    cells = _.find_elements(By.XPATH, ".//td")
    Row = [cell.text for cell in cells]
    data.append(Row)

print()
for _ in data:
    print(_)

# saving the data on a csv format
with open("Hiv and Aids Demographics and Behavioral Characteristics.csv", "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    if data:
        writer.writerow(['Cumulative cases Diagnosed Since 2016', 'Total Cases', 'Percent of Total Cases',
                        'Case rate', 'Total Deaths', 'Male Cases', 'Percent of Male Cases', 'Male Case Rate', 
                         'Female Cases', 'Percent Of Female Cases', 'Female Case Rate'])
        writer.writerows(data)

print()
print('Data saved to csv!')
    
time.sleep(10)
driver.quit()

Row Count: 6

['Total', '9508', '100.00%', '10.3', '195', '7864', '100.00%', '17.36', '', '', '']
['White, non-Hispanic', '1981', '20.84%', '3.41', '63', '1726', '21.95%', '6.06', '', '', '']
['Black, non-Hispanic', '4510', '47.43%', '34.32', '98', '3456', '43.95%', '56.18', '', '', '']
['Hispanic, all races', '2233', '23.49%', '14.83', '25', '2038', '25.92%', '26.22', '', '', '']
['Other', '601', '6.32%', '10.16', '8', '489', '6.22%', '17.07', '', '', '']
['Unknown', '183', '1.92%', '', '1', '155', '1.97%', '', '', '', '']

Data saved to csv!


In [26]:
df = pd.read_csv('February 2023 HIV Surveillance Update Report.csv')
df

Unnamed: 0,Cumulative Cases Diagnosed Since 2016,Total Cases,Percent of Total Cases,Case Rate,Total Deaths,Male Cases,Percent of Male Cases,Male Case Rate,Female Cases,Percent of Female Cases,Female Case Rate
0,Total,9508,100.00%,10.3,195,7864,100.00%,17.36,1644,100.00%,3.5
1,"White, non-Hispanic",1981,20.84%,3.41,63,1726,21.95%,6.06,255,15.51%,0.86
2,"Black, non-Hispanic",4510,47.43%,34.32,98,3456,43.95%,56.18,1054,64.11%,15.08
3,"Hispanic, all races",2233,23.49%,14.83,25,2038,25.92%,26.22,195,11.86%,2.68
4,Other,601,6.32%,10.16,8,489,6.22%,17.07,112,6.81%,3.67
5,Unknown,183,1.92%,,1,155,1.97%,,28,1.70%,


# Extrapolatory Data Analysis 
- we are performing EDA on the data to remove, identify outliers on the data.
- Find insight on the data by performing data distribution.
- Cleaning, removing duplicates and handling null values
- Lastly performing visualization on the data to
  1) Get a better storytelling
  2) Find correlation on the numerical data