<a href="https://colab.research.google.com/github/DevDevarakonda/COB/blob/main/experiment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#  Importing Datasets: Learning Objectives, Understanding the Domain, Understanding
# the Dataset, Python package for data science, Importing and Exporting Data in Python, Basic
# Insights from Datasets

# the above details are about a experiment i got in my lab exam so solve the question without missing single topic also


import pandas as pd
import numpy as np

# 1. Learning Objectives
# The goal of this experiment is to understand how to import and export datasets in Python,
# gain insights into the dataset, and explore the essential Python packages used in data science.
# This includes learning how to manipulate datasets, analyze missing values, and summarize key statistics.

# 2. Understanding the Domain
# Before working with a dataset, it's important to understand its domain.
# In this case, we are dealing with student performance data, which consists of marks obtained in different subjects.
# Understanding the significance of each subject and how marks are distributed will help in meaningful analysis.

# 3. Understanding the Dataset
# We'll use a sample dataset containing student names and their marks in different subjects.
# The dataset consists of 10 students, each with scores in Maths, Science, English, and History.
# This data can be used to analyze student performance trends, identify strengths and weaknesses,
# and generate insights for better academic planning.

data = {
    "Student": ["Alice", "Bob", "Charlie", "David", "Eve",],
    "Maths": [85, 78, 92, 88, 76],
    "Science": [90, 74, 89, 85, 80],
    "English": [78, 82, 88, 90, 75],
    "History": [80, 85, 79, 88, 82]
}

df = pd.DataFrame(data)

# 4. Python package for Data Science
# The primary Python packages used for data science include:
# - Pandas: Used for data manipulation and analysis, allowing us to handle datasets efficiently.
# - NumPy: Used for numerical computations and handling large-scale data operations.
# - Matplotlib/Seaborn: Used for data visualization to graphically represent insights from the data.
# In this experiment, we will focus on Pandas for dataset handling.

# 5. Importing and Exporting Data in Python
# Importing data can be done from various sources such as CSV, Excel, and databases.
# Here, we create the dataset manually but also demonstrate how to export it to a CSV file.

df.to_csv("students_marks.csv", index=False)
print("Dataset Exported Successfully")

# 6. Basic Insights from Datasets
# After importing the dataset, we can extract basic insights to understand the data better.

# Display the first 5 rows of the dataset
print("\nFirst 5 Rows of Dataset:")
print(df.head())

# Display basic information about the dataset
print("\nDataset Information:")
print(df.info())

# Display summary statistics
print("\nSummary Statistics:")
print(df.describe())

# Checking for missing values
print("\nMissing Values in Dataset:")
print(df.isnull().sum())

# These insights help in determining if there are missing values, unusual data distributions,
# or any other issues that need to be addressed before further analysis.


Dataset Exported Successfully

First 5 Rows of Dataset:
   Student  Maths  Science  English  History
0    Alice     85       90       78       80
1      Bob     78       74       82       85
2  Charlie     92       89       88       79
3    David     88       85       90       88
4      Eve     76       80       75       82

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Student  10 non-null     object
 1   Maths    10 non-null     int64 
 2   Science  10 non-null     int64 
 3   English  10 non-null     int64 
 4   History  10 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 532.0+ bytes
None

Summary Statistics:
           Maths    Science    English    History
count  10.000000  10.000000  10.000000  10.000000
mean   85.200000  85.100000  82.400000  84.100000
std     6.338594   6.332456   4.742245   4.581363
min    76.0

In [4]:
import pandas as pd
import numpy as np

# 1. Learning Objectives
# The goal of this experiment is to understand how to clean and prepare data in Python.
# This includes identifying and handling missing values, data formatting, normalization, binning, and indicator variables.

# 2. Understanding the Domain
# Data cleaning and preparation are essential steps in data science.
# They help ensure that the dataset is accurate, consistent, and suitable for analysis.
# In this case, we will work with student performance data and apply various data cleaning techniques.

# 3. Understanding the Dataset
# The dataset consists of student names and their marks in different subjects.
# The data may contain missing values or inconsistencies that need to be handled before further analysis.

data = {
    "Student": ["Dev", "Vamsi", "Sasank", "Yash", "Vishnu"],
    "Maths": [85, np.nan, 92, 88, 76],
    "Science": [90, 74, np.nan, 85, 80],
    "English": [78, 82, 88, 90, np.nan],
    "History": [80, 85, 79, np.nan, 82]
}

df = pd.DataFrame(data)

# 4. Identifying and Handling Missing Values
# Checking for missing values
print("\nMissing Values in Dataset:")
print(df.isnull().sum())

# Filling missing values with the mean of each column
df.fillna(df.select_dtypes(include=np.number).mean(), inplace=True)
print("\nDataset after Handling Missing Values:")
print(df)

# 5. Data Formatting
# Ensuring all column names are in proper format
df.columns = df.columns.str.strip().str.lower()
print("\nFormatted Column Names:", df.columns.tolist())

# 6. Data Normalization
# Scaling marks between 0 and 1
df.iloc[:, 1:] = (df.iloc[:, 1:] - df.iloc[:, 1:].min()) / (df.iloc[:, 1:].max() - df.iloc[:, 1:].min())
print("\nDataset after Normalization:")
print(df)

# 7. Binning
# Creating grade categories based on Maths scores
bins = [0, 0.3, 0.7, 1.0]
labels = ['Low', 'Medium', 'High']
df['maths_grade'] = pd.cut(df['maths'], bins=bins, labels=labels)
print("\nDataset after Binning Maths Scores:")
print(df)

# 8. Indicator Variables (One-Hot Encoding)
# Converting categorical variables into dummy variables
df = pd.get_dummies(df, columns=['maths_grade'])
print("\nDataset after Creating Indicator Variables:")
print(df)

# 9. Exporting Cleaned Data
df.to_csv("cleaned_students_marks.csv", index=False)
print("\nCleaned Dataset Exported Successfully")


Missing Values in Dataset:
Student    0
Maths      1
Science    1
English    1
History    1
dtype: int64

Dataset after Handling Missing Values:
  Student  Maths  Science  English  History
0     Dev  85.00    90.00     78.0     80.0
1   Vamsi  85.25    74.00     82.0     85.0
2  Sasank  92.00    82.25     88.0     79.0
3    Yash  88.00    85.00     90.0     81.5
4  Vishnu  76.00    80.00     84.5     82.0

Formatted Column Names: ['student', 'maths', 'science', 'english', 'history']

Dataset after Normalization:
  student     maths   science   english   history
0     Dev  0.562500  1.000000  0.000000  0.166667
1   Vamsi  0.578125  0.000000  0.333333  1.000000
2  Sasank  1.000000  0.515625  0.833333  0.000000
3    Yash  0.750000  0.687500  1.000000  0.416667
4  Vishnu  0.000000  0.375000  0.541667  0.500000

Dataset after Binning Maths Scores:
  student     maths   science   english   history maths_grade
0     Dev  0.562500  1.000000  0.000000  0.166667      Medium
1   Vamsi  0.578125 