# Programming for Data Science - 21KHDL1
# Final Project
# Topic:

## Student Information 
| MSSV     | Họ tên          |
| -------- | --------------- |
| 21120570 | Đặng Nguyễn Thanh Tín |
| 21120574 | Nguyễn Minh Trí |
| 21120580 | Trần Thị Kim Trinh |

## Table of contents
- [Overview](#overview)
- [Data Collection](#data-collection)
- [Data Pre-processing and Exploration](#data-pre-processing-and-exploration)
- [Quick view of Data](#quick-view-of-data)
- [Questions](#questions)
- [Reflection](#reflection)
- [References](#references)


# Overview

The Kaggle dataset on drug-related deaths from 2012-2018 provides comprehensive health-related information, encompassing various factors such as drug categories, demographics including gender and age, and the geographical context of fatalities. Despite its age, this data serves as a crucial resource for comprehending the drug issue and proposing preventative measures. Analyzing the dataset can pinpoint trends and factors contributing to fatalities, supporting prevention and treatment efforts. This presents an opportunity to address the public health challenge and formulate effective anti-drug strategies.

Libraries used

In [1]:
#import những gì bạn cần ở đây
import numpy as np
import pandas as pd

# Data Collection

**The Connecticut Deaths due to Drugs Dataset** contains information about **5105** people who died due to drug overdose between **2012 and 2018** in Connecticut, US.

The dataset includes data related to the age, race, gender, place of residence of the victims as well as the drugs they overdosed on. This information can be used to understand if drug use is prevalent in a specific area or city, drug use by individuals of different age groups and races as well as the popularity of different types of drugs.

The dataset has **41 columns** and **5105 rows**. The file have the following columns:
1. `ID`: ID of Patient
2. `Date`: The time which Patient died
3. `DateType`: Type of Date in Column 2 [Date of Reporting or Date of Death]
4. `Age`: Age of Patient
5. `Sex`: Sex of Patient
6. `Race`: Race of Patient
7. `ResidenceCity`: City of Residence
8. `ResidenceCounty`: County of Residence
9. `ResidenceState`: State of Residence
10. `DeathCity`: City of Death
11. `DeathCounty`: County of Death
12. `Location`: Location of Death [Hospital or Residence]
13. `LocationifOther`: Location of Death if Not Hospital or Residence
14. `DescriptionofInjury`: Cause of Death
15. `InjuryPlace`: Place of Event that caused Death
16. `InjuryCity`: City of Event that caused Death
17. `InjuryCounty`: County of Event that caused Death
18. `InjuryState`: State of Event that caused Death
19. `COD`: Detailed Cause of Death
20. `OtherSignifican`: Other Significant Injuries that may have lead to Death
21. `Heroin`: Drug Found in Body [Y/N]
22. `Cocaine`: Drug Found in Body [Y/N]
23. `Fentanyl`: Drug Found in Body [Y/N]
24. `FentanylAnalogue`: Drug Found in Body [Y/N]
25. `Oxycodone`: Drug Found in Body [Y/N]
26. `Oxymorphone`: Drug Found in Body [Y/N]
27. `Ethanol`: Drug Found in Body [Y/N]
28. `Hydrocodone`: Drug Found in Body [Y/N]
29. `Benzodiazepine`: Drug Found in Body [Y/N]
30. `Methadone`: Drug Found in Body [Y/N]
31. `Amphet`: Drug Found in Body [Y/N]
32. `Tramad`: Drug Found in Body [Y/N]
33. `Morphine_NotHeroin`: Drug Found in Body [Y/N]
34. `Hydromorphone`: Drug Found in Body [Y/N]
35. `Other`: Drug Found in Body [Y/N]
36. `OpiateNOS`: Drug Found in Body [Y/N]
37. `AnyOpioid`: Drug Found in Body [Y/N]
38. `MannerofDeath`: Manner of Death
39. `DeathCityGeo`: City of Death
40. `ResidenceCityGeo`: City of Residence
41. `InjuryCityGeo`: City of Injury

# Data Pre-processing and Exploration

**Read Data**

In [4]:
# Đọc dữ liệu từ ./Accidental Drug Related Deaths in Connecticut-2012-2018 và lưu vào DrugDeath_df
DrugDeath_df = pd.read_csv('./Data/Accidental_Drug_Related_Deaths_2012-2018.csv')

DrugDeath_df

Unnamed: 0,ID,Date,DateType,Age,Sex,Race,ResidenceCity,ResidenceCounty,ResidenceState,DeathCity,...,Tramad,Morphine_NotHeroin,Hydromorphone,Other,OpiateNOS,AnyOpioid,MannerofDeath,DeathCityGeo,ResidenceCityGeo,InjuryCityGeo
0,14-0273,06/28/2014 12:00:00 AM,DateReported,,,,,,,,...,,,,,,,Accident,"CT\n(41.575155, -72.738288)","CT\n(41.575155, -72.738288)","CT\n(41.575155, -72.738288)"
1,13-0102,03/21/2013 12:00:00 AM,DateofDeath,48.0,Male,Black,NORWALK,,,NORWALK,...,,,,,,,Accident,"Norwalk, CT\n(41.11805, -73.412906)","NORWALK, CT\n(41.11805, -73.412906)","CT\n(41.575155, -72.738288)"
2,16-0165,03/13/2016 12:00:00 AM,DateofDeath,30.0,Female,White,SANDY HOOK,FAIRFIELD,CT,DANBURY,...,,,,,,Y,Accident,"Danbury, CT\n(41.393666, -73.451539)","SANDY HOOK, CT\n(41.419998, -73.282501)",
3,16-0208,03/31/2016 12:00:00 AM,DateofDeath,23.0,Male,White,RYE,WESTCHESTER,NY,GREENWICH,...,,,,,,Y,Accident,"Greenwich, CT\n(41.026526, -73.628549)",,
4,13-0052,02/13/2013 12:00:00 AM,DateofDeath,22.0,Male,"Asian, Other",FLUSHING,QUEENS,,GREENWICH,...,,,,,,,Accident,"Greenwich, CT\n(41.026526, -73.628549)",,"CT\n(41.575155, -72.738288)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5100,15-0466,09/08/2015 12:00:00 AM,DateReported,43.0,Male,White,CHESHIRE,NEW HAVEN,CT,CHESHIRE,...,,,,,,,Accident,"CHESHIRE, CT\n(41.498834, -72.901448)","CHESHIRE, CT\n(41.498834, -72.901448)","CT\n(41.575155, -72.738288)"
5101,17-0618,07/22/2017 12:00:00 AM,DateReported,21.0,Male,White,MADISON,NEW HAVEN,CT,NEW HAVEN,...,,,,,,,Accident,"New Haven, CT\n(41.308252, -72.924161)","MADISON, CT\n(41.271447, -72.60086)","CT\n(41.575155, -72.738288)"
5102,18-0646,08/14/2018 12:00:00 AM,DateofDeath,30.0,Male,White,LAWRENCEVILLE,TIOGA,PA,DANBURY,...,Y,,,,,Y,Accident,"DANBURY, CT\n(41.393666, -73.451539)",,"DANBURY, CT\n(41.393666, -73.451539)"
5103,14-0124,03/16/2014 12:00:00 AM,DateofDeath,33.0,Male,White,HARTFORD,,,WINDSOR,...,,,,,,,Accident,"WINDSOR, CT\n(41.852781, -72.64379)","HARTFORD, CT\n(41.765775, -72.673356)","CT\n(41.575155, -72.738288)"


**How many rows and how many columns?**

In [6]:
# Lưu số dòng của DrugDeath_df vào n_rows và số cột của DrugDeath_df vào n_cols
n_rows, n_cols = DrugDeath_df.shape

# In ra màn hình số dòng và số cột của DrugDeath_df
print(f'({n_rows}, {n_cols})')

(5105, 41)


**What is the meaning of each row?**

Each row in this dataset represents information about an individual who passed away due to a drug overdose. Specifically:

<li><b>Demographic Information</b>: Age, gender, race, residential address.</li>
<li><b>Death Information</b>: Date of death, location of death, cause of death, manner of death.</li>
<li><b>Drug-related Information</b>: Presence of specific drugs in the body.</li>

**Are there duplicated rows? + Remove duplicate**

In [25]:
# Kiểm tra các dòng bị trùng lặp
duplicate_rows = DrugDeath_df[DrugDeath_df.duplicated()]

# Hiển thị nếu có dòng bị trùng lặp
if duplicate_rows.shape[0] > 0:
    print("There are duplicated rows.")
    # Xóa các dòng trùng lặp
    DrugDeath_df.drop_duplicates(inplace=True)
    print("Duplicates removed.")
else:
    print("There are no duplicated rows.")

There are no duplicated rows.


**Conclusion:** We can see that the dataset doesn't have duplicated rows.

**What is the meaning of each column?**

- The data columns is crucial for effective analysis. By carefully examining the column titles and their respective data entries, we can decipher the information they encapsulate. Given the extensive length of the column titles, renaming them for easier handling and analysis becomes essential.

- Through a thorough review of the column titles and their contents, aligned with the context of the survey questionnaire, we can gain a comprehensive understanding of the dataset. This process will enable us to effectively rename the columns, simplifying them for easier manipulation and analysis in subsequent steps.

- The columns in this dataset provide information about emergency cases due to drug overdoses.

What is the current data type of each column? Are there columns having inappropriate data types? If have, converting

In [24]:
# Kiểm tra kiểu dữ liệu hiện tại của các cột
print(DrugDeath_df.info())

# Chuyển đổi các cột sang kiểu dữ liệu phù hợp nếu cần
DrugDeath_df['Date'] = pd.to_datetime(DrugDeath_df['Date'], format='%d/%m/%Y %H:%M:%S')

# Kiểm tra lại kiểu dữ liệu sau khi chuyển đổi
print(DrugDeath_df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5105 entries, 0 to 5104
Data columns (total 41 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   ID                   5105 non-null   object        
 1   Date                 5103 non-null   datetime64[ns]
 2   DateType             5103 non-null   object        
 3   Age                  5102 non-null   float64       
 4   Sex                  5099 non-null   object        
 5   Race                 5092 non-null   object        
 6   ResidenceCity        4932 non-null   object        
 7   ResidenceCounty      4308 non-null   object        
 8   ResidenceState       3556 non-null   object        
 9   DeathCity            5100 non-null   object        
 10  DeathCounty          4005 non-null   object        
 11  Location             5081 non-null   object        
 12  LocationifOther      590 non-null    object        
 13  DescriptionofInjury  4325 non-nul

"With each categorical column, how are values distributed?
- What is the percentage of missing values?
- How many different values? Are they abnormal?"

In [27]:
# Lấy tất cả các cột phân loại
categorical_columns = DrugDeath_df.select_dtypes(include=['object']).columns.tolist()

# Phân tích từng cột phân loại
for column in categorical_columns:
    print(f"Column: {column}")
    
    # Phân phối giá trị
    print(f"Value Distribution:\n{DrugDeath_df[column].value_counts(normalize=True)}")
    
    # Giá trị thiếu
    missing_count = DrugDeath_df[column].isnull().sum()
    missing_percentage = (missing_count / len(DrugDeath_df)) * 100
    print(f"Missing Values: {missing_count} ({missing_percentage:.2f}%)")
    
    # Đếm số giá trị duy nhấts
    unique_values_count = DrugDeath_df[column].nunique()
    print(f"Number of Unique Values: {unique_values_count}")
    
    # Kiểm tra nếu số lượng giá trị duy nhất quá cao
    if unique_values_count > len(DrugDeath_df) / 2:
        print("High number of unique values, might be abnormal.")
    else:
        print("Values seem within an expected range.")
    
    print("-" * 50)

Column: ID
Value Distribution:
ID
14-0273    0.000196
13-0059    0.000196
14-0353    0.000196
15-0700    0.000196
14-0317    0.000196
             ...   
17-0709    0.000196
16-0178    0.000196
16-0410    0.000196
15-0003    0.000196
16-0637    0.000196
Name: proportion, Length: 5105, dtype: float64
Missing Values: 0 (0.00%)
Number of Unique Values: 5105
High number of unique values, might be abnormal.
--------------------------------------------------
Column: DateType
Value Distribution:
DateType
DateofDeath     0.553008
DateReported    0.446992
Name: proportion, dtype: float64
Missing Values: 2 (0.04%)
Number of Unique Values: 2
Values seem within an expected range.
--------------------------------------------------
Column: Sex
Value Distribution:
Sex
Male       0.739949
Female     0.259855
Unknown    0.000196
Name: proportion, dtype: float64
Missing Values: 6 (0.12%)
Number of Unique Values: 3
Values seem within an expected range.
--------------------------------------------------
C

# Quick view of Data

# Questions

# Reflection

# References