# 📊 Dataset Exploration: Understand, Identify Problems & Plan Next Steps

In this notebook, we'll perform basic data exploration using a JSON dataset. The process includes:

1. ✅ Understanding the dataset  
2. ❗ Identifying potential problems  
3. 📌 Planning next steps

We'll also explore:
- Viewing rows using `head()` and `tail()`
- Checking data types
- Counting missing values
- Checking memory usage and structure


In [None]:
# 🔃 Step 1: Importing required library
import pandas as pd


## 📂 Step 2: Load the Dataset

We are loading a JSON dataset located at `pract/sample_Data.json`.


In [None]:
# 🔍 Load JSON file into DataFrame
df = pd.read_json("pract/sample_Data.json")


## ✅ Step 3: Understand the Dataset

We will:
- View top and bottom rows
- Check dimensions
- View column names
- Get info about data types and memory usage


In [None]:
# 📌 Show first 7 rows (initial inspection)
print("🔹 Pehle 7 rows:")
print(df.head(7))  # head(n): Returns first n rows

# 📌 Show last 7 rows
print("\n🔹 Last ke 7 rows:")
print(df.tail(7))  # tail(n): Returns last n rows

# 📌 Default head() and tail(): first & last 5 rows
print("\n🔹 Starting ke 5 rows (default head):")
print(df.head())

print("\n🔹 Ending ke 5 rows (default tail):")
print(df.tail())


## ❗ Step 4: Identify Potential Problems

We will:
- Check number of rows and columns
- Identify column names
- Look at data types
- Find missing (null) values


In [None]:
# 🔢 Data types, non-null values, memory usage
print("\nℹ️ DataFrame Info:")
df.info()

# 5 galti sey chut gaya 
## 📈 Step 6: Summary Statistics using `describe()`

The `describe()` method gives us basic statistical info for all **numerical columns**, such as:

- `count`: Number of non-null entries  
- `mean`: Average value  
- `std`: Standard deviation (spread) -> jitna kam std utna zyada consistent data  
- `min`, `max`: Minimum and maximum values  
- `25%`, `50%`, `75%`: Percentiles (useful to understand distribution)


# VVIMP
# 🧠 What do 25%, 50%, and 75% mean?

They are **percentiles** (also called quartiles) that **help describe the spread/distribution of your data**.

---

### 📊 Think of your data as a sorted list of numbers.

Suppose you have a sorted list of ages: 
[10, 12, 15, 18, 21, 24, 28, 30, 35, 40]

---

#### 🔸 25% → First Quartile (Q1)

- **Meaning**: 25% of the values are **less than or equal to this value**.
- In our list: 25% point = 3rd value = **15**
- So, 25% of people are aged **15 or younger**

---

#### 🔸 50% → Second Quartile (Median)

- **Meaning**: Half of the values (50%) are **less than or equal to this value**
- This is also known as the **median**
- In our list: Median = average of 5th and 6th values = (21+24)/2 = **22.5**
- So, 50% of people are aged **22.5 or younger**

---

#### 🔸 75% → Third Quartile (Q3)

- **Meaning**: 75% of the values are **less than or equal to this value**
- In our list: 75% point = 8th value = **30**
- So, 75% of people are aged **30 or younger**

---

### 📦 Visual Representation:
|--------|--------|--------|--------|
 Min     25%      50%      75%     Max
         Q1       Q2       Q3


- This kind of division is also used in **boxplots** 📦 to visualize spread and detect outliers.

---

### ✅ Why it matters:

These values help you understand:
- If your data is **skewed** (leaning toward high or low values)
- Whether there are **outliers** (values that are way too high or low)
- How **spread out** your data is




# 📊 Employee Data Analysis using Pandas

In this notebook, we'll:

- Create a sample dataset of employees
- Display the dataset
- Use `.describe()` to get statistical insights


In [None]:
# 🔃 Step 1: Import pandas library
import pandas as pd

## 🧾 Step 2: Create the Dataset

We are creating a simple dataset with the following columns:
- Name
- Age
- Salary
- Performance Score


In [None]:
# 🧱 Creating sample employee data
data = {
    "Name": ['Ram', 'Shyam', 'Ghanshyam', 'Dhanshyam', 'Aditi', 'Jagdish','Raj', 'Simran'],
    "Age" : [28, 32, 47, 57, 17, 27, 77, 25],
    "Salary": [5000, 6000, 45000, 5200, 4900, 7000, 9000, 17000],
    "Performance Score": [43, 71, 26, 59, 84, 38, 67, 22]  
}

# 📋 Creating DataFrame from dictionary
df = pd.DataFrame(data)

## 🔍 Step 3: View the Created Dataset
We'll print the full DataFrame to inspect our data.


In [None]:
# Display the full dataset
print("📌 Created Dataset:")
print(df)

## 📈 Step 4: Statistical Summary using `.describe()`

This function provides:
- Count, mean, std deviation
- Min, 25%, 50%, 75%, and max values
for all numeric columns.


In [None]:
# Statistical summary of numeric columns
print("\n📊 Statistical Summary (describe):")
print(df.describe())


# baki kam agle notebook mey
