In Python, a DataFrame is one of the most important data structures provided by the pandas library. Think of it as a two-dimensional, tabular data structure similar to an Excel spreadsheet or a SQL table. It is used for organizing, manipulating, analyzing, and visualizing structured data.

Here's what a pandas DataFrame does and how it works:

Key Features:
Organizes Data in Rows and Columns:

Rows represent individual records or observations.
Columns represent features or attributes of the data.
Handles Heterogeneous Data:

Columns in a DataFrame can store different types of data (e.g., integers, floats, strings, booleans).
Flexible Indexing:

Every row and column in the DataFrame is associated with an index, which makes data selection and alignment easy.
Built-in Data Operations:

Allows for complex data manipulation such as filtering, sorting, aggregating, pivoting, merging, reshaping, etc.

In [None]:
import pandas as pd

data = {
    "calories" : [420, 3880, 390],
    "duration" : [50, 40, 45]
}

df = pd.DataFrame(data)
print(df)

   calories  duration  duration12
0       420        50          50
1      3880        40          40
2       390        45          45


Filtering Example

In [None]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'San Francisco', 'Chicago']
}

df = pd.DataFrame(data)
filtered_df = df[df['Age'] > 28]
print(filtered_df)



      Name  Age           City
1      Bob   30  San Francisco
2  Charlie   35        Chicago


Selecting specific column

In [None]:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'San Francisco', 'Chicago']
}
df = pd.DataFrame(data)
filtered_df = df['Name']
print(filtered_df)



0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object


Add a new column

In [None]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'San Francisco', 'Chicago']
}
df = pd.DataFrame(data)
df['Country'] = ['USA', 'USA', 'USA']
print(df)



      Name  Age           City Country
0    Alice   25       New York     USA
1      Bob   30  San Francisco     USA
2  Charlie   35        Chicago     USA


If we have multiple columns with the same name, the values from the last column will override all others. It is not a good practice to have same column names, but if you want to have them, you need to specify that manually.

In [None]:
import pandas as pd

data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'A': [7, 8, 9]  # Duplicate column name 'A'
}

df = pd.DataFrame(data)
#df.columns = ['A', 'A', 'C']
#this works if column names are unique to start with

print(df)


   A  A  C
0  1  4  7
1  2  5  8
2  3  6  9


Importing file with dataset

In [2]:
from google.colab import files
uploaded = files.upload()  # Opens a file browser to upload the CSV


Saving test.csv to test.csv


In [None]:
import pandas as pd

df = pd.read_csv('test.csv')
print(df)


       id  battery_power  blue  clock_speed  dual_sim  fc  four_g  int_memory  \
0       1           1043     1          1.8         1  14       0           5   
1       2            841     1          0.5         1   4       1          61   
2       3           1807     1          2.8         0   1       0          27   
3       4           1546     0          0.5         1  18       1          25   
4       5           1434     0          1.4         0  11       1          49   
..    ...            ...   ...          ...       ...  ..     ...         ...   
995   996           1700     1          1.9         0   0       1          54   
996   997            609     0          1.8         1   0       0          13   
997   998           1185     0          1.4         0   1       1           8   
998   999           1533     1          0.5         1   0       0          50   
999  1000           1270     1          0.5         0   4       1          35   

     m_dep  mobile_wt  ... 

Handle missing or duplicate rows

In [None]:
import pandas as pd

df = pd.read_csv('test.csv')

df.dropna()

df.drop_duplicates()

The .head() function in pandas is used to view the first few rows of a DataFrame. By default, it displays the first 5 rows, but you can specify how many rows to display.

In [None]:
import pandas as pd

df = pd.read_csv('test.csv')
dfRows = df.head() #by default, it will return 5
print(dfRows)

#dfRows2 = df.head(10) #as specified, it will return 10
#print(dfRows2)

   id  battery_power  blue  clock_speed  dual_sim  fc  four_g  int_memory  \
0   1           1043     1          1.8         1  14       0           5   
1   2            841     1          0.5         1   4       1          61   
2   3           1807     1          2.8         0   1       0          27   
3   4           1546     0          0.5         1  18       1          25   
4   5           1434     0          1.4         0  11       1          49   

   m_dep  mobile_wt  ...  pc  px_height  px_width   ram  sc_h  sc_w  \
0    0.1        193  ...  16        226      1412  3476    12     7   
1    0.8        191  ...  12        746       857  3895     6     0   
2    0.9        186  ...   4       1270      1366  2396    17    10   
3    0.5         96  ...  20        295      1752  3893    10     0   
4    0.5        108  ...  18        749       810  1773    15     8   

   talk_time  three_g  touch_screen  wifi  
0          2        0             1     0  
1          7        1 

The df.info() method in pandas is used to provide a concise summary of a DataFrame. It is a highly useful function for getting an overview of the structure and contents of your data.

Key Features of df.info():
Column Names:

Displays all column names in the DataFrame.
Data Types:

Shows the data type of each column (e.g., int64, float64, object).
Non-Null Values:

Indicates how many non-null (non-missing) values each column contains.
Memory Usage:

Reports the approximate memory usage of the DataFrame.

In [3]:
import pandas as pd

df = pd.read_csv('test.csv')
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             1000 non-null   int64  
 1   battery_power  1000 non-null   int64  
 2   blue           1000 non-null   int64  
 3   clock_speed    1000 non-null   float64
 4   dual_sim       1000 non-null   int64  
 5   fc             1000 non-null   int64  
 6   four_g         1000 non-null   int64  
 7   int_memory     1000 non-null   int64  
 8   m_dep          1000 non-null   float64
 9   mobile_wt      1000 non-null   int64  
 10  n_cores        1000 non-null   int64  
 11  pc             1000 non-null   int64  
 12  px_height      1000 non-null   int64  
 13  px_width       1000 non-null   int64  
 14  ram            1000 non-null   int64  
 15  sc_h           1000 non-null   int64  
 16  sc_w           1000 non-null   int64  
 17  talk_time      1000 non-null   int64  
 18  three_g  

In [4]:
import pandas as pd

df = pd.read_csv('test.csv')
print(len(df))

1000


Shapes returns a tuple where:

The first element is the number of rows in the DataFrame.
The second element is the number of columns in the DataFrame.

In [6]:
import pandas as pd

df = pd.read_csv('test.csv')
print(df.shape[0])
print(df.shape[1])

1000
21



The df.describe() function in pandas is used to generate summary statistics for the numerical columns in a DataFrame. It provides a quick overview of the data's distribution and is especially helpful in exploratory data analysis (EDA).

Key Features of df.describe():
Summary Statistics:

It calculates common statistical metrics for numerical columns:
Count: Number of non-null values.
Mean: Average value.
Standard deviation (std): Measure of variation or spread.
Min: Minimum value.
25%, 50% (Median), 75%: Percentiles.
Max: Maximum value.
Applies to Numerical Data by Default:

By default, it summarizes only numeric columns (e.g., integers, floats).
Can Include All Data:

You can specify include='all' to describe both numeric and non-numeric (e.g., object, categorical) columns.

In [7]:
df.describe()

Unnamed: 0,id,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,...,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,500.5,1248.51,0.516,1.5409,0.517,4.593,0.487,33.652,0.5175,139.511,...,10.054,627.121,1239.774,2138.998,11.995,5.316,11.085,0.756,0.5,0.507
std,288.819436,432.458227,0.499994,0.829268,0.499961,4.463325,0.500081,18.128694,0.280861,34.85155,...,6.095099,432.929699,439.670981,1088.092278,4.320607,4.240062,5.497636,0.429708,0.50025,0.500201
min,1.0,500.0,0.0,0.5,0.0,0.0,0.0,2.0,0.1,80.0,...,0.0,0.0,501.0,263.0,5.0,0.0,2.0,0.0,0.0,0.0
25%,250.75,895.0,0.0,0.7,0.0,1.0,0.0,18.0,0.3,109.75,...,5.0,263.75,831.75,1237.25,8.0,2.0,6.75,1.0,0.0,0.0
50%,500.5,1246.5,1.0,1.5,1.0,3.0,0.0,34.5,0.5,139.0,...,10.0,564.5,1250.0,2153.5,12.0,5.0,11.0,1.0,0.5,1.0
75%,750.25,1629.25,1.0,2.3,1.0,7.0,1.0,49.0,0.8,170.0,...,16.0,903.0,1637.75,3065.5,16.0,8.0,16.0,1.0,1.0,1.0
max,1000.0,1999.0,1.0,3.0,1.0,19.0,1.0,64.0,1.0,200.0,...,20.0,1907.0,1998.0,3989.0,19.0,18.0,20.0,1.0,1.0,1.0


The value_counts() function in pandas is used to count the occurrences of unique values in a column of a DataFrame. It returns a pandas Series where:

The index represents the unique values.
The values represent the frequency of each unique value.

In [8]:
df.value_counts()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0,Unnamed: 10_level_0,Unnamed: 11_level_0,Unnamed: 12_level_0,Unnamed: 13_level_0,Unnamed: 14_level_0,Unnamed: 15_level_0,Unnamed: 16_level_0,Unnamed: 17_level_0,Unnamed: 18_level_0,Unnamed: 19_level_0,Unnamed: 20_level_0,count
id,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,Unnamed: 21_level_1
1,1043,1,1.8,1,14,0,5,0.1,193,3,16,226,1412,3476,12,7,2,0,1,0,1
672,1544,0,1.0,1,2,1,64,0.3,193,1,6,595,675,1715,8,0,16,1,1,1,1
659,1281,0,3.0,1,2,1,6,0.6,191,2,9,363,811,2713,13,7,4,1,1,0,1
660,1208,0,2.7,0,0,0,17,0.2,168,2,1,1686,1793,2165,9,1,19,1,1,1,1
661,1023,1,2.8,1,16,0,44,0.7,176,4,17,629,1158,1830,14,11,6,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
339,1709,0,2.2,1,1,1,13,0.6,191,6,6,74,516,3899,16,13,3,1,0,1,1
340,1690,0,1.3,1,0,0,28,0.9,85,3,1,1304,1332,3571,18,2,6,1,0,1,1
341,600,0,2.5,1,3,0,9,0.2,145,4,10,584,714,2800,17,6,5,0,1,0,1
342,743,0,0.5,0,9,1,53,0.3,149,2,10,670,1070,2938,9,3,13,1,1,1,1


Filtering

In [10]:
# Filter rows where battery_power is greater than 1500
filtered_df = df[df['battery_power'] > 1500]

print(filtered_df)


      id  battery_power  blue  clock_speed  dual_sim  fc  four_g  int_memory  \
2      3           1807     1          2.8         0   1       0          27   
3      4           1546     0          0.5         1  18       1          25   
6      7           1718     0          2.4         0   1       0          47   
9     10           1520     0          0.5         0   1       0          25   
15    16           1846     1          1.0         0   5       1          53   
..   ...            ...   ...          ...       ...  ..     ...         ...   
990  991           1807     0          1.2         0   4       0          37   
991  992           1797     1          2.6         0   4       0          42   
992  993           1895     0          0.5         1   0       1          62   
995  996           1700     1          1.9         0   0       1          54   
998  999           1533     1          0.5         1   0       0          50   

     m_dep  mobile_wt  ...  pc  px_heig

In [None]:
# Filter rows where blue == 1 (has Bluetooth) and four_g == 1 (supports 4G)
filtered_df = df[(df['blue'] == 1) & (df['four_g'] == 1)]

print(filtered_df)


In [11]:
num_touchscreen = len(df.groupby(['touch_screen']))
print(num_touchscreen)

2


In [13]:
stats = df.groupby('touch_screen').aggregate({
    'battery_power': ['mean', 'max'],
    'ram': ['mean', 'min']
})
print(stats)


             battery_power             ram     
                      mean   max      mean  min
touch_screen                                   
0                 1252.892  1999  2186.474  263
1                 1244.128  1996  2091.522  265
