In [3]:
#Importing required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Phase 1: Data Understanding

In this phase, the objective is to explore and understand the dataset before performing any cleaning, analysis, or modeling.  
This step helps in identifying the structure of the data, data types, missing values, and potential data quality issues.

Understanding the dataset thoroughly is crucial as it forms the foundation for all further data preprocessing and analysis tasks.


In [37]:
# Loading the Dataset

df = pd.read_csv(r'C:\Users\nithi\OneDrive\Desktop\EDA PROJECT\archive\SuicideChina.csv')

In [38]:
# Identify the number of rows and columns in the dataset

df.shape

(2571, 12)

In [30]:
# View Sample data

df.head()

Unnamed: 0.1,Unnamed: 0,Person_ID,Hospitalised,Died,Urban,Year,Month,Sex,Age,Education,Occupation,method
0,1,1,yes,no,no,2010,12,female,39,Secondary,household,Other poison
1,2,2,no,yes,no,2009,3,male,83,primary,farming,Hanging
2,3,3,no,yes,no,2010,2,male,60,primary,farming,Hanging
3,4,4,no,yes,no,2011,1,male,73,primary,farming,Hanging
4,5,5,yes,no,no,2009,8,male,51,Secondary,farming,Pesticide


## Data Insights

#### This dataset represents suicide attempt cases in China, capturing demographic, social, and outcome-related information of individuals.

#### The dataset contains 12 columns, they are:
- Unnamed: 0 – Index column representing row numbers
- Person_ID – Unique identifier assigned to each individual
- Hospitalised – Indicates whether the person was hospitalized after the attempt
- Died – Shows whether the suicide attempt resulted in death
- Urban – Indicates whether the individual belongs to an urban or rural area
- Year – Year in which the suicide attempt was recorded
- Month – Month in which the suicide attempt occurred
- Sex – Gender of the individual
- Age – Age of the individual at the time of the incident
- Education – Educational qualification of the individual
- Occupation – Occupation of the individual
- Method – Method used for the suicide attempt


In [15]:
# View sample data 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2571 entries, 0 to 2570
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    2571 non-null   int64 
 1   Person_ID     2571 non-null   int64 
 2   Hospitalised  2571 non-null   object
 3   Died          2571 non-null   object
 4   Urban         2571 non-null   object
 5   Year          2571 non-null   int64 
 6   Month         2571 non-null   int64 
 7   Sex           2571 non-null   object
 8   Age           2571 non-null   int64 
 9   Education     2571 non-null   object
 10  Occupation    2571 non-null   object
 11  method        2571 non-null   object
dtypes: int64(5), object(7)
memory usage: 241.2+ KB


## Data Insights:

#### This dataset contains 2,571 rows and 12 columns. In the dataset, there are 7 columns with categorical data and 5 columns with numerical data.

### The categorical columns are:
- Hospitalised (object)
- Died (object)
- Urban (object)
- Sex (object)
- Education (object)
- Occupation (object)
- Method (object)

### The numerical columns are:
- Unnamed: 0 (int64)
- Person_ID (int64)
- Year (int64)
- Month (int64)
- Age (int64)


In [17]:
#Summary stats

df.describe()

Unnamed: 0.1,Unnamed: 0,Person_ID,Year,Month,Age
count,2571.0,2571.0,2571.0,2571.0,2571.0
mean,1286.0,1286.0,2010.045508,6.298327,52.630883
std,742.328095,742.328095,0.791412,3.202515,19.783878
min,1.0,1.0,2009.0,1.0,12.0
25%,643.5,643.5,2009.0,4.0,37.0
50%,1286.0,1286.0,2010.0,6.0,53.0
75%,1928.5,1928.5,2011.0,9.0,69.0
max,2571.0,2571.0,2011.0,12.0,100.0


### Data Insights:
We got all the summary statistics of all the 5 numerical columns in the dataset.

#### Unnamed-0
- count: 2571
- mean: 1286.00
- std: 742.46
- min: 1.00
- 25%: 643.00
- 50%: 1286.00
- 75%: 1929.00
- max: 2571.00

#### Person-ID
- count: 2571
- mean: 1286.00
- std: 742.46
- min: 1.00
- 25%: 643.00
- 50%: 1286.00
- 75%: 1929.00
- max: 2571.00

#### Year
- count: 2571
- mean: 2010.05
- std: 0.78
- min: 2009
- 25%: 2009
- 50%: 2010
- 75%: 2011
- max: 2011

#### Month
- count: 2571
- mean: 6.30
- std: 3.20
- min: 1
- 25%: 4
- 50%: 6
- 75%: 9
- max: 12

#### Age
- count: 2571
- mean: 52.63
- std: 19.78
- min: 12
- 25%: 37
- 50%: 53
- 75%: 69
- max: 100


In [19]:
# Missing values

df.isnull().sum()

Unnamed: 0      0
Person_ID       0
Hospitalised    0
Died            0
Urban           0
Year            0
Month           0
Sex             0
Age             0
Education       0
Occupation      0
method          0
dtype: int64

### Data insights:

#### There are no missing values in the dataset. All columns contain complete data.

In [24]:
# Understand categorical variables

for col in df.select_dtypes(include='object').columns:
    print(f"Column: {col}")
    print(df[col].value_counts())
    print("-" * 40)

Column: Hospitalised
Hospitalised
yes    1553
no     1018
Name: count, dtype: int64
----------------------------------------
Column: Died
Died
no     1315
yes    1256
Name: count, dtype: int64
----------------------------------------
Column: Urban
Urban
no         2213
yes         277
unknown      81
Name: count, dtype: int64
----------------------------------------
Column: Sex
Sex
female    1328
male      1243
Name: count, dtype: int64
----------------------------------------
Column: Education
Education
Secondary    1280
primary       659
iliterate     533
unknown        80
Tertiary       19
Name: count, dtype: int64
----------------------------------------
Column: Occupation
Occupation
farming             2032
household            248
others/unknown       156
professional          37
student               35
unemployed            30
business/service      21
worker                 6
others                 3
retiree                3
Name: count, dtype: int64
---------------------------

### Data Insights:

#### We got all the unique values and their counts for all the categorical columns.

#### Hospitalised:
- No: 1,315
- Yes: 1,256

#### Died:
- No: 1,315
- Yes: 1,256

#### Urban:
- No (Rural): 2,213
- Yes (Urban): 277
- Unknown: 81

#### Sex:
- Female: 1,328
- Male: 1,243

#### Education:
- Secondary: 1,280
- Primary: 659
- Illiterate: 533
- Unknown: 80
- Tertiary: 19

#### Occupation:
- Farming: 2,032
- Household: 248
- Others / Unknown: 156
- Professional: 37
- Student: 35

#### Method:
- Pesticide: 1,768
- Hanging: 431
- Cutting: 155
- Jumping: 89
- Others: 128


## Questions to Answer in Report

### 1. What is the shape of your dataset (rows and columns)?

**Ans:**
The dataset contains **2,571 rows and 12 columns**.

--------------------------------------------------

### 2. Which columns contain missing values?

**Ans:**
There are **no missing values** in the dataset.

- All columns contain **0 missing values**
- The dataset is complete and well-maintained
- No missing value treatment is required

--------------------------------------------------

### 3. What are the data types of each column?

**Ans:**

- **object:**
  - Hospitalised
  - Died
  - Urban
  - Sex
  - Education
  - Occupation
  - Method

- **int64:**
  - Unnamed: 0
  - Person_ID
  - Year
  - Month
  - Age

--------------------------------------------------

### 4. What is the distribution of key fields like:

### Gender / Category columns

- **Hospitalised:**
  - No: 1,315
  - Yes: 1,256

- **Died:**
  - No: 1,315
  - Yes: 1,256

- **Urban:**
  - No (Rural): 2,213
  - Yes (Urban): 277
  - Unknown: 81

- **Sex:**
  - Female: 1,328
  - Male: 1,243

- **Education:**
  - Secondary: 1,280
  - Primary: 659
  - Illiterate: 533
  - Unknown: 80
  - Tertiary: 19

- **Occupation:**
  - Farming: 2,032
  - Household: 248
  - Others / Unknown: 156
  - Professional: 37
  - Student: 35

- **Method:**
  - Pesticide: 1,768
  - Hanging: 431
  - Cutting: 155
  - Jumping: 89
  - Others: 128

--------------------------------------------------

### Target variable (if applicable)

- There is **no explicit target variable** in the dataset.
- However, **Died** can be considered as a target variable, as it indicates whether the suicide attempt resulted in death (Yes / No).

--------------------------------------------------

### Any important identifiers or scores

- **Person_ID:** Unique identifier for each individual case
- **Unnamed: 0:** Index column representing row numbers

--------------------------------------------------

### 5. Are there any data quality issues like typos, outliers, or inconsistent formatting?

**Ans:**

- There are no missing values in the dataset
- The dataset is well-structured and consistently formatted
- The **Age** column may contain outliers (very low or very high values)
- Some categorical columns contain an **"Unknown"** category

-------------------------------------------------

## Data Quality Observations

Based on the exploratory analysis, the following data quality issues were identified:

- Several columns contain missing values and will require appropriate imputation techniques.
- Some categorical variables show class imbalance, which may affect model performance.
- Numerical features may contain outliers that could skew statistical analysis.
- Certain fields may require standardization or consistency checks during preprocessing.

## Summary

- In this phase, an initial exploration of the dataset was performed to understand its structure and quality.  
- The dataset consists of both numerical and categorical features, with several columns containing missing values.  
- Key exploratory steps such as viewing sample records, inspecting data types, analyzing summary statistics, and examining categorical distributions were completed.

- The analysis revealed the presence of missing values, potential outliers, and class imbalance in certain categorical variables.  
- These observations highlight the need for proper data cleaning, transformation, and preprocessing, which will be handled in the next phase.

- Overall, this data understanding phase provides a strong foundation for effective data preparation and further analytical modeling.
