[link text](https://)**Analyzing U.S. COVID-19 Data**

**Imports:**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind
import scipy.stats as stats
from google.colab import drive
import statsmodels.api as sm
from scipy.stats import f_oneway
from scipy.stats import chi2_contingency
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import classification_report, accuracy_score


**Loading the data:**

In [None]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
data1_path = '/content/drive/Shareddrives/team6/data.csv'

df1 = pd.read_csv(data1_path)


  df1 = pd.read_csv(data1_path)


In [None]:
df1_cleaned = df1.drop_duplicates()

**PART 3: Hypothesis Testing:**

**Claim: “There is a strong association between probability of death due to COVID-19 and patient demographics”**

  **3.1 Formulate a hypothesis test to assess the validity of this claim given**
the available data:

● State the test you will use and justify your choice.

● Clearly state the hypotheses.

● Conduct the test and report the result.

● Make a conclusion as to the validity of the claim, assume a significance level
of 0.05.

we will use The Chi-Square Test as Chi-Square Test of Independence is suitable for this analysis because it is designed to assess whether there is a significant association between two categorical variables. In this case, the categorical variables are the combined demographics and the death outcome.

Null Hypothesis (H₀): There is no association between the probability of death due to COVID-19 and patient demographics

Alternative Hypothesis (H₁): There is an association between the probability of death due to COVID-19 and patient demographics

In [None]:
df = df1_cleaned[['age_group', 'sex', 'race', 'death_yn']].copy()
df = df[~df['age_group'].isin(['Missing', 'Unknown'])]
df = df[~df['sex'].isin(['Missing', 'Unknown'])]
df = df[~df['race'].isin(['Missing', 'Unknown'])]
df = df[~df['death_yn'].isin(['Missing', 'Unknown'])]
df.dropna(inplace=True)
df['demographics'] = df[['age_group', 'sex', 'race']].astype(str).agg('_'.join, axis=1)
df

Unnamed: 0,age_group,sex,race,death_yn,demographics
4,65+ years,Female,White,No,65+ years_Female_White
10,18 to 49 years,Female,Black,No,18 to 49 years_Female_Black
12,18 to 49 years,Female,White,No,18 to 49 years_Female_White
13,50 to 64 years,Female,White,No,50 to 64 years_Female_White
16,0 - 17 years,Male,White,No,0 - 17 years_Male_White
...,...,...,...,...,...
19020868,65+ years,Female,White,No,65+ years_Female_White
19020903,18 to 49 years,Female,White,No,18 to 49 years_Female_White
19020907,50 to 64 years,Male,White,No,50 to 64 years_Male_White
19020912,18 to 49 years,Male,White,No,18 to 49 years_Male_White


In [None]:
df_copy = df[['demographics', 'death_yn']].copy()

# Create a contingency table
contingency_table = pd.crosstab(df_copy['demographics'], df_copy['death_yn'])

# Perform the chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Print the results
print("Chi-square statistic:", chi2)
print("p-value:", p)


Chi-square statistic: 126131.64445956664
p-value: 0.0


we will Reject the null hypothesis (H₀) as the p value is less than the significance level : There is an association between the probability of death due to COVID-19 and patient demographics.

In [None]:
contingency_table

death_yn,No,Yes
demographics,Unnamed: 1_level_1,Unnamed: 2_level_1
0 - 17 years_Female_American Indian/Alaska Native,1013,0
0 - 17 years_Female_Asian,4186,0
0 - 17 years_Female_Black,20905,0
0 - 17 years_Female_Multiple/Other,2775,0
0 - 17 years_Female_Native Hawaiian/Other Pacific Islander,43,0
0 - 17 years_Female_White,89987,0
0 - 17 years_Male_American Indian/Alaska Native,1048,0
0 - 17 years_Male_Asian,4588,0
0 - 17 years_Male_Black,21044,0
0 - 17 years_Male_Multiple/Other,3783,0






```

```

 **3.2 Come up with your own claim from the available data and conduct
a hypothesis test for it following in the same steps.**

claim: Investigate whether patients with exposure are more likely to die.





**● State the test you will use and justify your choice:**

we will use z-test for proportions. This test is chosen because it assesses whether there is a significant difference in proportions between two independent groups (patients with exposure and patients without exposure). In this case, we are interested in comparing the proportion of deaths (a binary outcome) between these two groups. The z-test for proportions allows us to determine if the observed difference in proportions is statistically significant, providing valuable insights into the relationship between exposure and mortality risk in the patient population.

**● Clearly state the hypotheses.**

Null Hypothesis (H_0):

There is no significant association between exposure and the proportion of patients that death.

Alternative Hypothesis (H_1):

There is a significant association between exposure and the proportion of patients that death.

**● Conduct the test and report the result.**

In [None]:
data = df1_cleaned [df1_cleaned['death_yn'].isin(['Yes', 'No'])]
df= data[['exposure_yn', 'death_yn']].copy()


In [None]:
# Filter patients based on exposure status
patients_with_exposure = df[(df['exposure_yn'] == 'Yes') | (df['exposure_yn'] == 'Unknown')]
patients_without_exposure = df[(df['exposure_yn'] == 'Missing')]

# Further filter patients based on death status
patients_with_exposure_and_death = patients_with_exposure[patients_with_exposure['death_yn'] == 'Yes']
patients_without_exposure_and_death = patients_without_exposure[patients_without_exposure['death_yn'] == 'Yes']

# Calculate the number of patients in each group
num_with_exposure = len(patients_with_exposure)
num_without_exposure = len(patients_without_exposure)

# Calculate the number of deaths in each group
num_deaths_with_exposure = len(patients_with_exposure_and_death)
num_deaths_without_exposure = len(patients_without_exposure_and_death)

# Perform the z-test for proportions
z_score, p_value = sm.stats.proportions_ztest(
    [num_deaths_with_exposure, num_deaths_without_exposure],
    [num_with_exposure, num_without_exposure],
    alternative='larger'
)

# Output the results
z_score, p_value


(-70.24192843956261, 1.0)

 **make a conclusion as to the validity of the claim, assume a significance level
of 0.05.**

as shown p value is greater than the significance level "0.05" so we Fail to reject the null hypothesis: There is no significant association between exposure and the proportion of patients that death.

**Bonus:**

 **Train a machine/deep learning classifier to predict the likelihood of death due to
COVID-19 using any/all of the relevant attributes in the COVID case surveillance
dataset.**

In [None]:
df = df1_cleaned[['age_group', 'sex', 'race', 'death_yn']].copy()
df = df[~df['age_group'].isin(['Missing', 'Unknown'])]
df = df[~df['sex'].isin(['Missing', 'Unknown'])]
df = df[~df['race'].isin(['Missing', 'Unknown'])]
df = df[~df['death_yn'].isin(['Missing', 'Unknown'])]
df.dropna(inplace=True)

# One-Hot Encoding for categorical features
encoder = OneHotEncoder(drop='first', sparse=False)
encoded_features = encoder.fit_transform(df[['age_group', 'sex', 'race']])

# Label Encoding for target variable
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(df['death_yn'])

# Combine encoded features with labels
X = encoded_features
y = encoded_labels

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build the neural network model
model = Sequential()
model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))

# Evaluate the model
y_pred = (model.predict(X_test) > 0.5).astype("int32")

# Print evaluation metrics
print(classification_report(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))




Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.79      0.91      0.85    349394
           1       0.01      0.07      0.02      9516
           2       0.00      0.00      0.00     97248

    accuracy                           0.70    456158
   macro avg       0.27      0.33      0.29    456158
weighted avg       0.61      0.70      0.65    456158

Accuracy: 0.6994221300514295


  _warn_prf(average, modifier, msg_start, len(result))
