## **Python for data science : Data Exploration and Analysis**

### What is Data?


- Information that we collect and analyze.
- It can be stored in different formats such as an Excel sheet, CSV file, or text file.

### Structure of a Dataset

- **Rows**: Entries recorded.
- **Columns**: Features or attributes that are considered.



### How Do We Read and Work with a Large Dataset?

When working with large datasets, we often receive a dataset along with a **problem statement** that we need to solve. Before jumping into solving the problem, we need to first **understand the dataset**.

#### Key Questions to Ask:

1. **If I give you a problem statement and a dataset, what do you do with it?**  
   - Identify the goal: What are we trying to achieve?  
   - Understand what kind of data we have.  

2. **How do you go through your dataset to understand what you are working with?**  
   - Read the dataset into Python.  
   - Explore the structure of the data.  
   - Check for missing values, duplicates, and data types.  
   - Perform analysis to get insights.  


### Introduction to Pandas

Pandas is a powerful Python library that makes data manipulation and analysis simple

In [7]:
import pandas as pd

#### What do we use pandas for?

- Given a dataset you need to be able to read your dataset and analyse it and to do so we use pandas


---
#### About Dataset

This dataset contains detailed records of simulated road accident data, focusing on factors influencing survival outcomes. The dataset includes demographic, behavioral, and situational attributes, providing valuable insights into how various factors impact the survival probability during road accidents.

---

#### *How to read your Dataset?*

- **Step 1 : Load your dataset**

In [8]:
#this is a  csv hence you use read_csv
data = pd.read_csv('accident.csv')

#For other data format you will use

#FOR EXCEL FILES
#data = pd.read_excel("data.xlsx", sheet_name="Sheet1")

#FOR JSON FILES
#data = pd.read_json("data.json")

- **Step 2 : Read the  first 5 lines  of your data to get an idea ofwhat you are working on**

In [9]:
data.head()

Unnamed: 0,Age,Gender,Speed_of_Impact,Helmet_Used,Seatbelt_Used,Survived
0,56,Female,27.0,No,No,1
1,69,Female,46.0,No,Yes,1
2,46,Male,46.0,Yes,Yes,0
3,32,Male,117.0,No,Yes,0
4,60,Female,40.0,Yes,Yes,0


This will give you an idea of the kind of data you are working with and in cases where the problem statement does not give clear details about your attributes,this will be the perfect place to start looking



- **Step 3 : Check the size ofyour dataset**

In [10]:
data.shape

(200, 6)

When you look at the shape you get to know how many attributes may affect the problem you are trying to solve and how large your sample is

- **Step 4: List your columns(optional)**

In [11]:
data.columns

Index(['Age', 'Gender', 'Speed_of_Impact', 'Helmet_Used', 'Seatbelt_Used',
       'Survived'],
      dtype='object')

This shows you the attributes/features in your dataset

- **Step 5 : Get information about data types and missing values**

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Age              200 non-null    int64  
 1   Gender           199 non-null    object 
 2   Speed_of_Impact  197 non-null    float64
 3   Helmet_Used      200 non-null    object 
 4   Seatbelt_Used    200 non-null    object 
 5   Survived         200 non-null    int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 9.5+ KB


At this stage, we want to identify the attributes (columns) we are working with and check for any missing values.

- **Step 5 : Statistical summary for numerical columns**

In [13]:
data.describe()

Unnamed: 0,Age,Speed_of_Impact,Survived
count,200.0,197.0,200.0
mean,43.425,70.441624,0.505
std,14.94191,30.125298,0.50123
min,18.0,20.0,0.0
25%,31.0,43.0,0.0
50%,43.5,71.0,1.0
75%,56.0,95.0,1.0
max,69.0,119.0,1.0


This gives the description of your numeric values.You only do this if you see that your data has some numeric datatypes

### *How to explore and analyse your data with  pandas*

After loading the data, guide your audience through the process of exploring it to understand its structure and quality.

- **Step 1 : Identifying Data Types and Missing Values**

In [14]:
data.isnull().sum()

Age                0
Gender             1
Speed_of_Impact    3
Helmet_Used        0
Seatbelt_Used      0
Survived           0
dtype: int64

1. We use **isnull()** to check if any attributes contain missing data.  
2. We then use **sum()** to count the number of missing entries in each specific column. 

- **Step 2 : Cleaning and Preprocessing**

Sometimes, datasets contain repeated entries, which can lead to inaccurate analysis.  
In such cases, we need to **identify and remove duplicate records** to ensure data quality.

In [15]:
# Check for duplicate rows
print(data.duplicated().sum())

# Remove duplicate rows
data= data.drop_duplicates()

# Verify that duplicates have been removed
print(data.duplicated().sum())

0
0


### **Step 4: Handle Missing Data**  

Missing data can impact analysis, so we need to **handle it properly**. Depending on the situation, we can either **replace** or **remove** missing values.

**1. Replacing Missing Values**

You can replace missing values using different strategies:  

- **Fill with a specific value (e.g., 0):**  
  ```python
  df = df.fillna(0)


In [16]:
#here we are removing null rows instead of replacing them
data_cleaned =data.dropna()