In [1]:
import pandas as pd
import numpy as np

#### Loading the CSV as a DataFrame

In [2]:
df = pd.read_csv('StudentsPerformance.csv')

#### Viewing the DataFrame

In [3]:
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [4]:
df.tail()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77
999,female,group D,some college,free/reduced,none,77,86,86


#### Getting Column Name

In [5]:
df.columns

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')

#### Renaming Columns

In [6]:
df.rename(columns = {'gender': 'Gender', 'lunch': 'Lunch'}, inplace = True)
df

Unnamed: 0,Gender,race/ethnicity,parental level of education,Lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


#### Re-ordering Columns

In [7]:
column_names  = ['race/ethnicity', 'Gender','parental level of education', 'Lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score']
df = df.reindex(columns=column_names)
df

Unnamed: 0,race/ethnicity,Gender,parental level of education,Lunch,test preparation course,math score,reading score,writing score
0,group B,female,bachelor's degree,standard,none,72,72,74
1,group C,female,some college,standard,completed,69,90,88
2,group B,female,master's degree,standard,none,90,95,93
3,group A,male,associate's degree,free/reduced,none,47,57,44
4,group C,male,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,group E,female,master's degree,standard,completed,88,99,95
996,group C,male,high school,free/reduced,none,62,55,55
997,group C,female,high school,free/reduced,completed,59,71,65
998,group D,female,some college,standard,completed,68,78,77


#### Displaying Basic Information

In [8]:
df.shape

(1000, 8)

In [9]:
df.count()

race/ethnicity                 1000
Gender                         1000
parental level of education    1000
Lunch                          1000
test preparation course        1000
math score                     1000
reading score                  1000
writing score                  1000
dtype: int64

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   race/ethnicity               1000 non-null   object
 1   Gender                       1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   Lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


In [11]:
df.index

RangeIndex(start=0, stop=1000, step=1)

In [12]:
df.sum()

race/ethnicity                 group Bgroup Cgroup Bgroup Agroup Cgroup Bgrou...
Gender                         femalefemalefemalemalemalefemalefemalemalemale...
parental level of education    bachelor's degreesome collegemaster's degreeas...
Lunch                          standardstandardstandardfree/reducedstandardst...
test preparation course        nonecompletednonenonenonenonecompletednonecomp...
math score                                                                 66089
reading score                                                              69169
writing score                                                              68054
dtype: object

In [13]:
df.isna().sum()

race/ethnicity                 0
Gender                         0
parental level of education    0
Lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

In [14]:
df.dtypes

race/ethnicity                 object
Gender                         object
parental level of education    object
Lunch                          object
test preparation course        object
math score                      int64
reading score                   int64
writing score                   int64
dtype: object

In [15]:
df.describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


In [16]:
df_math_count = df['math score'].value_counts().reset_index()
df_math_count.columns = ['math score', 'count']
df_math_count

Unnamed: 0,math score,count
0,65,36
1,62,35
2,69,32
3,59,32
4,61,27
...,...,...
76,24,1
77,28,1
78,33,1
79,18,1


In [17]:
df_math_count.sort_values('math score', ascending=False, inplace=True)
df_math_count.reset_index(inplace=True, drop=True)
df_math_count

Unnamed: 0,math score,count
0,100,7
1,99,3
2,98,3
3,97,6
4,96,3
...,...,...
76,22,1
77,19,1
78,18,1
79,8,1


#### Sorting based on Column Values

In [18]:
df.sort_values('math score', ascending=False)

Unnamed: 0,race/ethnicity,Gender,parental level of education,Lunch,test preparation course,math score,reading score,writing score
962,group E,female,associate's degree,standard,none,100,100,100
625,group D,male,some college,standard,completed,100,97,99
458,group E,female,bachelor's degree,standard,none,100,100,100
623,group A,male,some college,standard,completed,100,96,86
451,group E,female,some college,standard,none,100,92,97
...,...,...,...,...,...,...,...,...
145,group C,female,some college,free/reduced,none,22,39,33
787,group B,female,some college,standard,none,19,38,32
17,group B,female,some high school,free/reduced,none,18,32,28
980,group B,female,high school,free/reduced,none,8,24,23


#### Selecting Rows by Position

In [19]:
df_20_100 = df[20:100]

In [20]:
df_20_100

Unnamed: 0,race/ethnicity,Gender,parental level of education,Lunch,test preparation course,math score,reading score,writing score
20,group D,male,high school,standard,none,66,69,63
21,group B,female,some college,free/reduced,completed,65,75,70
22,group D,male,some college,standard,none,44,54,53
23,group C,female,some high school,standard,none,69,73,73
24,group D,male,bachelor's degree,free/reduced,completed,74,71,80
...,...,...,...,...,...,...,...,...
95,group C,male,associate's degree,free/reduced,completed,78,81,82
96,group B,male,some high school,standard,completed,65,66,62
97,group E,female,some college,standard,completed,63,72,70
98,group D,female,some college,free/reduced,none,58,67,62


#### Dropping Columns

In [21]:
df_drop = df.drop(['Gender', 'parental level of education', 'test preparation course'], axis=1)
df_drop

Unnamed: 0,race/ethnicity,Lunch,math score,reading score,writing score
0,group B,standard,72,72,74
1,group C,standard,69,90,88
2,group B,standard,90,95,93
3,group A,free/reduced,47,57,44
4,group C,standard,76,78,75
...,...,...,...,...,...
995,group E,standard,88,99,95
996,group C,free/reduced,62,55,55
997,group C,free/reduced,59,71,65
998,group D,standard,68,78,77


#### Selecting Desired Columns

In [22]:
df_gen_lun = df[['Gender', 'Lunch']]
df_gen_lun

Unnamed: 0,Gender,Lunch
0,female,standard
1,female,standard
2,female,standard
3,male,free/reduced
4,male,standard
...,...,...
995,female,standard
996,male,free/reduced
997,female,free/reduced
998,female,standard


#### Selecting Based on Condition

In [23]:
dfmath = df[df['math score'] > 50]
dfmath.head(20)

Unnamed: 0,race/ethnicity,Gender,parental level of education,Lunch,test preparation course,math score,reading score,writing score
0,group B,female,bachelor's degree,standard,none,72,72,74
1,group C,female,some college,standard,completed,69,90,88
2,group B,female,master's degree,standard,none,90,95,93
4,group C,male,some college,standard,none,76,78,75
5,group B,female,associate's degree,standard,none,71,83,78
6,group B,female,some college,standard,completed,88,95,92
8,group D,male,high school,free/reduced,completed,64,64,67
10,group C,male,associate's degree,standard,none,58,54,52
12,group B,female,high school,standard,none,65,81,73
13,group A,male,some college,standard,completed,78,72,70


In [24]:
dfmath.reset_index(inplace = True, drop=True)
dfmath.head(20)

Unnamed: 0,race/ethnicity,Gender,parental level of education,Lunch,test preparation course,math score,reading score,writing score
0,group B,female,bachelor's degree,standard,none,72,72,74
1,group C,female,some college,standard,completed,69,90,88
2,group B,female,master's degree,standard,none,90,95,93
3,group C,male,some college,standard,none,76,78,75
4,group B,female,associate's degree,standard,none,71,83,78
5,group B,female,some college,standard,completed,88,95,92
6,group D,male,high school,free/reduced,completed,64,64,67
7,group C,male,associate's degree,standard,none,58,54,52
8,group B,female,high school,standard,none,65,81,73
9,group A,male,some college,standard,completed,78,72,70


In [25]:
dfmath_reading= df[(df['math score'] > 50) & (df['reading score'] > 70)]
dfmath_reading.reset_index(inplace = True, drop=True)
dfmath_reading.head(20)

Unnamed: 0,race/ethnicity,Gender,parental level of education,Lunch,test preparation course,math score,reading score,writing score
0,group B,female,bachelor's degree,standard,none,72,72,74
1,group C,female,some college,standard,completed,69,90,88
2,group B,female,master's degree,standard,none,90,95,93
3,group C,male,some college,standard,none,76,78,75
4,group B,female,associate's degree,standard,none,71,83,78
5,group B,female,some college,standard,completed,88,95,92
6,group B,female,high school,standard,none,65,81,73
7,group A,male,some college,standard,completed,78,72,70
8,group C,female,some high school,standard,none,69,75,78
9,group C,male,high school,standard,none,88,89,86


In [26]:
dfmath_reading= df[(df['math score'] > 50) | (df['reading score'] > 70)]
dfmath_reading.reset_index(inplace = True, drop=True)
dfmath_reading.head(20)


Unnamed: 0,race/ethnicity,Gender,parental level of education,Lunch,test preparation course,math score,reading score,writing score
0,group B,female,bachelor's degree,standard,none,72,72,74
1,group C,female,some college,standard,completed,69,90,88
2,group B,female,master's degree,standard,none,90,95,93
3,group C,male,some college,standard,none,76,78,75
4,group B,female,associate's degree,standard,none,71,83,78
5,group B,female,some college,standard,completed,88,95,92
6,group D,male,high school,free/reduced,completed,64,64,67
7,group C,male,associate's degree,standard,none,58,54,52
8,group B,female,high school,standard,none,65,81,73
9,group A,male,some college,standard,completed,78,72,70


In [27]:
dfmath_72 = df.loc[df['math score'] == 72, ['reading score', 'writing score']]
dfmath_72

Unnamed: 0,reading score,writing score
0,72,74
83,64,63
126,68,67
170,73,74
226,72,71
345,80,75
459,65,68
467,67,65
547,67,64
642,81,79


#### Pandas Groupby

In [28]:
df

Unnamed: 0,race/ethnicity,Gender,parental level of education,Lunch,test preparation course,math score,reading score,writing score
0,group B,female,bachelor's degree,standard,none,72,72,74
1,group C,female,some college,standard,completed,69,90,88
2,group B,female,master's degree,standard,none,90,95,93
3,group A,male,associate's degree,free/reduced,none,47,57,44
4,group C,male,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,group E,female,master's degree,standard,completed,88,99,95
996,group C,male,high school,free/reduced,none,62,55,55
997,group C,female,high school,free/reduced,completed,59,71,65
998,group D,female,some college,standard,completed,68,78,77


In [53]:
gender_group = df.groupby('Gender') 
print (type(gender_group))
print (type(df))

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
<class 'pandas.core.frame.DataFrame'>


In [54]:
df_gender_group = df.groupby('Gender').agg('count')
df_gender_group

Unnamed: 0_level_0,race/ethnicity,parental level of education,Lunch,test preparation course,math score,reading score,writing score
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,518,518,518,518,518,518,518
male,482,482,482,482,482,482,482


In [55]:
df_gender_group = df.groupby('Gender').agg('mean')
df_gender_group

  df_gender_group = df.groupby('Gender').agg('mean')


Unnamed: 0_level_0,math score,reading score,writing score
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,63.633205,72.608108,72.467181
male,68.728216,65.473029,63.311203


In [56]:
df_gender_group = df_gender_group.reset_index(level=0, drop=True)

In [57]:
df_gender_group

Unnamed: 0,math score,reading score,writing score
0,63.633205,72.608108,72.467181
1,68.728216,65.473029,63.311203


In [58]:
df

Unnamed: 0,race/ethnicity,Gender,parental level of education,Lunch,test preparation course,math score,reading score,writing score
0,group B,female,bachelor's degree,standard,none,72,72,74
1,group C,female,some college,standard,completed,69,90,88
2,group B,female,master's degree,standard,none,90,95,93
3,group A,male,associate's degree,free/reduced,none,47,57,44
4,group C,male,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,group E,female,master's degree,standard,completed,88,99,95
996,group C,male,high school,free/reduced,none,62,55,55
997,group C,female,high school,free/reduced,completed,59,71,65
998,group D,female,some college,standard,completed,68,78,77


In [59]:
df_gender_eth_group = df.groupby(['Gender', 'race/ethnicity', 'parental level of education',
                                  'Lunch', 'test preparation course']).agg('mean')
df_gender_eth_group

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,math score,reading score,writing score
Gender,race/ethnicity,parental level of education,Lunch,test preparation course,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,group A,associate's degree,free/reduced,none,47.666667,64.333333,60.000000
female,group A,associate's degree,standard,completed,60.000000,67.500000,68.000000
female,group A,associate's degree,standard,none,82.000000,93.000000,93.000000
female,group A,bachelor's degree,standard,none,51.666667,60.000000,61.666667
female,group A,high school,free/reduced,completed,54.666667,62.000000,62.000000
...,...,...,...,...,...,...,...
male,group E,some college,standard,completed,87.250000,79.750000,74.500000
male,group E,some college,standard,none,73.750000,67.750000,63.333333
male,group E,some high school,free/reduced,completed,75.500000,75.000000,69.500000
male,group E,some high school,standard,completed,79.333333,72.333333,70.500000


#### One Hot Encoding

In [60]:
df_OHE = pd.get_dummies(data=df, columns=['Gender', 'race/ethnicity', 'parental level of education', 'Lunch', 'test preparation course'])

In [61]:
df_OHE

Unnamed: 0,math score,reading score,writing score,Gender_female,Gender_male,race/ethnicity_group A,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E,parental level of education_associate's degree,parental level of education_bachelor's degree,parental level of education_high school,parental level of education_master's degree,parental level of education_some college,parental level of education_some high school,Lunch_free/reduced,Lunch_standard,test preparation course_completed,test preparation course_none
0,72,72,74,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1
1,69,90,88,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,0
2,90,95,93,1,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,1
3,47,57,44,0,1,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1
4,76,78,75,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,88,99,95,1,0,0,0,0,0,1,0,0,0,1,0,0,0,1,1,0
996,62,55,55,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1
997,59,71,65,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,1,0
998,68,78,77,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1,1,0


In [62]:
df.head(3)

Unnamed: 0,race/ethnicity,Gender,parental level of education,Lunch,test preparation course,math score,reading score,writing score
0,group B,female,bachelor's degree,standard,none,72,72,74
1,group C,female,some college,standard,completed,69,90,88
2,group B,female,master's degree,standard,none,90,95,93


#Take-away

Data mining is now a critical talent for data scientists in the modern world. The technique of identifying patterns and removing pertinent information from huge databases is known as data mining. During this practical exercise, we discovered how to import a CSV as a data frame and carry out fundamental operations on it. We will go over a number of insights learned from this experience in this essay.

How to load a CSV as a data frame is the main lesson to be learned from this practice. A data frame is a tabular, two-dimensional data structure where the rows correspond to cases and the columns to variables. Learning how to load a CSV file as a data frame is crucial for anyone studying data science, as it is a widely used format for storing and exchanging data. The second lesson is understanding how to see the upper and lower portions of several rows in a data frame. This is a crucial function since it gives us a fast overview of the data, which is very useful when working with enormous datasets. How to examine the columns and reorder them to our satisfaction is another thing to remember. When we want to concentrate on particular variables or study them in a particular order, this is useful. We can make the data easier to work with and increase productivity by rearranging the columns. We can also obtain and present the fundamental data we require or want by using the data frame. This is just another lesson learned from this exercise. Using built-in methods in Python, it is simple to extract fundamental data such as the number of rows and columns, the data types of each variable, and summary statistics. Finally, we discovered that a CSV file may be used as a data frame to mine a variety of information. We can draw patterns, correlations, and insights from the data by applying data mining tools. Making informed decisions is important in a variety of industries, including marketing, finance, and healthcare.

In conclusion, we have learned a lot from this practical data mining exercise. We discovered how to load a CSV as a data frame, see the upper and lower parts of the data frame, see and rearrange columns, present fundamental facts, and extract practical insights from the data. These skills are essential for anyone working with data, particularly for data science students like us.
