Imports Panda

In [1]:
import pandas as pd

Creating a DataFrame from a .csv File

In [2]:
students_df = pd.read_csv('./students.csv')

Print the first 6 rows of the DataFrame

In [3]:
print(students_df.head(6))  # Shows the first 6 rows

   StudentID     Name  Age  Grade
0        101    Alice   16     10
1        102      Bob   17     11
2        103  Charlie   16     10
3        104    David   17     11
4        105      Eva   16     10
5        106    Frank   16     10


Print the last 3 rows of the DataFrame

In [4]:
print(students_df.tail(3))  # Shows the last 3 rows

   StudentID    Name  Age  Grade
7        108  Hannah   16     10
8        109     Ivy   17     11
9        110    Jack   16     10


Checking the Shape of a DataFrame

The next important thing we’ll check is the shape of our DataFrame. Datasets come in all shapes and sizes, and we won’t always know what to expect. Checking the shape gives us a quick way to see if we’re working with 300 columns and 2 million rows or something much smaller.

Finding the shape of our data

In [5]:
print(students_df.shape) #Finding the shape of our data

(10, 4)


Checking Column Names

Another important step is looking at our column names. Datasets don’t always have clear or consistent column names, so printing them out helps us see what we’re working with. This way, we can spot any weird names, typos, or extra spaces that might cause issues later.

Print the Column names

In [6]:
print(students_df.columns)

Index(['StudentID', 'Name', 'Age', 'Grade'], dtype='object')


Finding Null Values

Real-world data is messy, and missing values are pretty common. We can check for null values to see if we’re missing important information. This helps us decide if we need to clean the data, fill in gaps, or drop certain rows or columns.

In [7]:
# Check for null values in the entire DataFrame
print(students_df.isnull().sum())

StudentID    0
Name         0
Age          0
Grade        0
dtype: int64


In [8]:
# Alternatively, check for null values in specific columns
print(students_df['StudentID'].isnull().sum())
print(students_df['Name'].isnull().sum())
print(students_df['Age'].isnull().sum())
print(students_df['Grade'].isnull().sum())
# Check for null values in the entire DataFrame
print(students_df.isnull().sum())

0
0
0
0
StudentID    0
Name         0
Age          0
Grade        0
dtype: int64


In [9]:
# Print the Data types
print(students_df.dtypes)

StudentID     int64
Name         object
Age           int64
Grade         int64
dtype: object


In [13]:
# Convert Grade to a string
#students_df['Grade'] = .astype(students_df['Grade'])
students_df['Grade'] = students_df['Grade'].astype(str)

In [14]:
# Print the Data types
print(students_df.dtypes)

StudentID     int64
Name         object
Age           int64
Grade        object
dtype: object


The .describe() method in pandas provides a quick statistical summary of your numerical columns, offering insights into the distribution and spread of your data. It is an essential step in Exploratory Data Analysis (EDA) because it helps you:

# Understand Data Distribution: View basic statistics like count, mean, standard deviation, min/max values, and quartiles.
# Identify Outliers: Spot unusually high or low values by examining min/max and percentiles.
# Check Data Consistency: Ensure values are within expected ranges.
# Assess Missing Data: The count can reveal discrepancies between the total rows and non-null entries.

In [15]:
# Generate descriptive statistics
print(students_df.describe())

       StudentID        Age
count   10.00000  10.000000
mean   105.50000  16.400000
std      3.02765   0.516398
min    101.00000  16.000000
25%    103.25000  16.000000
50%    105.50000  16.000000
75%    107.75000  17.000000
max    110.00000  17.000000


Key Insights from .describe():

#️⃣ Central Tendency: The mean shows the average value of each column.
#️⃣ Spread of Data: The std (standard deviation) indicates how spread out the values are.
#️⃣ Minimum and Maximum: The min and max values can help identify potential outliers.
#️⃣ Percentiles: The 25%, 50% (median), and 75% percentiles provide insights into the data distribution.

Benefits of Using .describe() Early in EDA:

#️⃣ Quick Snapshot: It offers a concise overview of your data without needing to manually calculate metrics.
#️⃣ Guides Further Analysis: Highlights columns that might need cleaning, transformation, or deeper investigation.
#️⃣ Supports Data Cleaning: Reveals inconsistencies, such as unexpected minimum or maximum values.
#️⃣ Informs Visualization Choices: Helps decide which plots might best represent the data.

When using .describe() with the include='all' parameter, pandas generates descriptive statistics for both numerical and non-numerical (e.g., categorical, boolean, and object) columns. This provides a broader overview of the dataset, not limited to numerical summaries.

In [16]:
print(students_df.describe(include='all'))

        StudentID   Name        Age Grade
count    10.00000     10  10.000000    10
unique        NaN     10        NaN     2
top           NaN  Alice        NaN    10
freq          NaN      1        NaN     6
mean    105.50000    NaN  16.400000   NaN
std       3.02765    NaN   0.516398   NaN
min     101.00000    NaN  16.000000   NaN
25%     103.25000    NaN  16.000000   NaN
50%     105.50000    NaN  16.000000   NaN
75%     107.75000    NaN  17.000000   NaN
max     110.00000    NaN  17.000000   NaN


Let's find the sum of columns StudentID