## Pandas-DataFrame And Series 
Pandas is a powerful data manipulation library in python, widely used for data analysis and data cleaning. It provides two primary data structures: Series and DataFrame . A Series is a one-dimensional array-like object, while a DataFrame is a two-dimensional, size-mutable, and potentially hetrogeneous tabular data Structure while labeled axes (rows and columns).

In [1]:
import pandas as pd

In [59]:
## Series 
## Pandas Series a one-dimensional array-like object that can hold any data type 

data=[1,3,4,5]
series=pd.Series(data)
print("Series\n",series)
print(type(series))

Series
 0    1
1    3
2    4
3    5
dtype: int64
<class 'pandas.core.series.Series'>


In [60]:
## Create a series from dictionary 
data={'a':1,'b':2,'c':3}
series=pd.Series(data)
print(series)
print(type(series))

a    1
b    2
c    3
dtype: int64
<class 'pandas.core.series.Series'>


In [61]:
data=[10,20,30]
index=['a','b','c']

series=pd.Series(data,index=index)
print(series)

a    10
b    20
c    30
dtype: int64


In [62]:
## DataFrame 
## Create a dataframe from a dictionary of list

data={
    'name':['madhav','subodh','suraj','vishal'],
    'age':[23,24,56,75],
    'city':['delshi','patna','mumbai','goa']
}
df=pd.DataFrame(data)
print(df)
print(type(df))

     name  age    city
0  madhav   23  delshi
1  subodh   24   patna
2   suraj   56  mumbai
3  vishal   75     goa
<class 'pandas.core.frame.DataFrame'>


In [63]:
import numpy as np

In [64]:
## Create a Data frame From a list of dictionary

data=[
   {'name':'madhav','age':32,'city':'jaipur'},
    {'name':'subodh','age':43,'city':'mumbai'}
]

df=pd.DataFrame(data)
print(df)
print(type(df))

     name  age    city
0  madhav   32  jaipur
1  subodh   43  mumbai
<class 'pandas.core.frame.DataFrame'>


In [65]:
data

[{'name': 'madhav', 'age': 32, 'city': 'jaipur'},
 {'name': 'subodh', 'age': 43, 'city': 'mumbai'}]

In [66]:
data

[{'name': 'madhav', 'age': 32, 'city': 'jaipur'},
 {'name': 'subodh', 'age': 43, 'city': 'mumbai'}]

In [67]:
df

Unnamed: 0,name,age,city
0,madhav,32,jaipur
1,subodh,43,mumbai


In [68]:
df['name']

0    madhav
1    subodh
Name: name, dtype: object

In [69]:
df.loc[0]

name    madhav
age         32
city    jaipur
Name: 0, dtype: object

In [70]:
df.loc[0][0]

  df.loc[0][0]


'madhav'

In [72]:
## Accessing a specified element 
df.at[0,'age']

np.int64(32)

In [17]:
## Accessing a specified element using iat
df.iat[2,2]

IndexError: index 2 is out of bounds for axis 0 with size 2

In [18]:
df

Unnamed: 0,name,age,city
0,madhav,32,jaipur
1,subodh,43,mumbai


In [73]:
## Data Manipulation with Dataframe
df

Unnamed: 0,name,age,city
0,madhav,32,jaipur
1,subodh,43,mumbai


In [75]:
df['salary']=[80000,90000]
df

Unnamed: 0,name,age,city,salary
0,madhav,32,jaipur,80000
1,subodh,43,mumbai,90000


In [76]:
## Delete the columns 

df.drop('salary',axis=1) # Not delete the parmenent 

Unnamed: 0,name,age,city
0,madhav,32,jaipur
1,subodh,43,mumbai


In [77]:
df

Unnamed: 0,name,age,city,salary
0,madhav,32,jaipur,80000
1,subodh,43,mumbai,90000


In [78]:
# delete the parmenent 
df.drop(['salary'],axis=1,inplace=True)

In [79]:
df

Unnamed: 0,name,age,city
0,madhav,32,jaipur
1,subodh,43,mumbai


In [80]:
df['age']=df['age'] +3
df

Unnamed: 0,name,age,city
0,madhav,35,jaipur
1,subodh,46,mumbai


In [81]:
df.drop(0,inplace = True)

In [82]:
df

Unnamed: 0,name,age,city
1,subodh,46,mumbai


In [83]:
df.describe()

Unnamed: 0,age
count,1.0
mean,46.0
std,
min,46.0
25%,46.0
50%,46.0
75%,46.0
max,46.0


In [84]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 1 to 1
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    1 non-null      object
 1   age     1 non-null      int64 
 2   city    1 non-null      object
dtypes: int64(1), object(2)
memory usage: 156.0+ bytes


## Data Manipulation and Analysis with Pandas 

Data manipulation and analysis are key tasks in any data science or dat analysis project. Pandas provides a wide range of functions for data manipulation and analysis, making it easier to clean, transform and extract insights from data. in this lesson, well will cover various data manipulation and analysis techniques using Pandas.

In [40]:

data=pd.read_csv('train.csv')



In [41]:
data.head()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
3,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
4,7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9


In [42]:
data.tail()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
328,500,0.17783,0.0,9.69,0,0.585,5.569,73.5,2.3999,6,391,19.2,395.77,15.1,17.5
329,502,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
330,503,0.04527,0.0,11.93,0,0.573,6.12,76.7,2.2875,1,273,21.0,396.9,9.08,20.6
331,504,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.9,5.64,23.9
332,506,0.04741,0.0,11.93,0,0.573,6.03,80.8,2.505,1,273,21.0,396.9,7.88,11.9


In [43]:
data.describe()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
count,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0
mean,250.951952,3.360341,10.689189,11.293483,0.06006,0.557144,6.265619,68.226426,3.709934,9.633634,409.279279,18.448048,359.466096,12.515435,22.768769
std,147.859438,7.352272,22.674762,6.998123,0.237956,0.114955,0.703952,28.133344,1.981123,8.742174,170.841988,2.151821,86.584567,7.067781,9.173468
min,1.0,0.00632,0.0,0.74,0.0,0.385,3.561,6.0,1.1296,1.0,188.0,12.6,3.5,1.73,5.0
25%,123.0,0.07896,0.0,5.13,0.0,0.453,5.884,45.4,2.1224,4.0,279.0,17.4,376.73,7.18,17.4
50%,244.0,0.26169,0.0,9.9,0.0,0.538,6.202,76.7,3.0923,5.0,330.0,19.0,392.05,10.97,21.6
75%,377.0,3.67822,12.5,18.1,0.0,0.631,6.595,93.8,5.1167,24.0,666.0,20.2,396.24,16.42,25.0
max,506.0,73.5341,100.0,27.74,1.0,0.871,8.725,100.0,10.7103,24.0,711.0,21.2,396.9,37.97,50.0


In [44]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 15 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   ID       333 non-null    int64  
 1   crim     333 non-null    float64
 2   zn       333 non-null    float64
 3   indus    333 non-null    float64
 4   chas     333 non-null    int64  
 5   nox      333 non-null    float64
 6   rm       333 non-null    float64
 7   age      333 non-null    float64
 8   dis      333 non-null    float64
 9   rad      333 non-null    int64  
 10  tax      333 non-null    int64  
 11  ptratio  333 non-null    float64
 12  black    333 non-null    float64
 13  lstat    333 non-null    float64
 14  medv     333 non-null    float64
dtypes: float64(11), int64(4)
memory usage: 39.2 KB


In [45]:
data.dtypes

ID           int64
crim       float64
zn         float64
indus      float64
chas         int64
nox        float64
rm         float64
age        float64
dis        float64
rad          int64
tax          int64
ptratio    float64
black      float64
lstat      float64
medv       float64
dtype: object

In [46]:
## Handling Missing Value
data.isnull().any(axis=1)

0      False
1      False
2      False
3      False
4      False
       ...  
328    False
329    False
330    False
331    False
332    False
Length: 333, dtype: bool

In [47]:
data.isnull().any()

ID         False
crim       False
zn         False
indus      False
chas       False
nox        False
rm         False
age        False
dis        False
rad        False
tax        False
ptratio    False
black      False
lstat      False
medv       False
dtype: bool

In [48]:
data.isnull().sum()

ID         0
crim       0
zn         0
indus      0
chas       0
nox        0
rm         0
age        0
dis        0
rad        0
tax        0
ptratio    0
black      0
lstat      0
medv       0
dtype: int64

In [49]:
data.fillna(0)

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
3,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
4,7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.60,12.43,22.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
328,500,0.17783,0.0,9.69,0,0.585,5.569,73.5,2.3999,6,391,19.2,395.77,15.10,17.5
329,502,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
330,503,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
331,504,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9


In [50]:
## Filling missing value with the mean of the column

data['crim']=data['crim'].fillna(data['crim'].mean())

In [51]:
data.head()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
3,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
4,7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9


In [52]:
df.dtypes

name    object
age      int64
city    object
dtype: object

In [53]:
## Renaming Columns

data=data.rename(columns={'crim':'crime'})
data

Unnamed: 0,ID,crime,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
3,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
4,7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.60,12.43,22.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
328,500,0.17783,0.0,9.69,0,0.585,5.569,73.5,2.3999,6,391,19.2,395.77,15.10,17.5
329,502,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
330,503,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
331,504,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9


In [54]:
## Change datatypes 

data['age']=data['age'].astype(int)
data.head()

Unnamed: 0,ID,crime,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65,4.09,1,296,15.3,396.9,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78,4.9671,2,242,17.8,396.9,9.14,21.6
2,4,0.03237,0.0,2.18,0,0.458,6.998,45,6.0622,3,222,18.7,394.63,2.94,33.4
3,5,0.06905,0.0,2.18,0,0.458,7.147,54,6.0622,3,222,18.7,396.9,5.33,36.2
4,7,0.08829,12.5,7.87,0,0.524,6.012,66,5.5605,5,311,15.2,395.6,12.43,22.9


In [55]:
data.dtypes

ID           int64
crime      float64
zn         float64
indus      float64
chas         int64
nox        float64
rm         float64
age          int64
dis        float64
rad          int64
tax          int64
ptratio    float64
black      float64
lstat      float64
medv       float64
dtype: object

In [56]:
data['new_age']=data['age'].apply(lambda x:x*2)

In [57]:
data.head()

Unnamed: 0,ID,crime,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv,new_age
0,1,0.00632,18.0,2.31,0,0.538,6.575,65,4.09,1,296,15.3,396.9,4.98,24.0,130
1,2,0.02731,0.0,7.07,0,0.469,6.421,78,4.9671,2,242,17.8,396.9,9.14,21.6,156
2,4,0.03237,0.0,2.18,0,0.458,6.998,45,6.0622,3,222,18.7,394.63,2.94,33.4,90
3,5,0.06905,0.0,2.18,0,0.458,7.147,54,6.0622,3,222,18.7,396.9,5.33,36.2,108
4,7,0.08829,12.5,7.87,0,0.524,6.012,66,5.5605,5,311,15.2,395.6,12.43,22.9,132


In [86]:
# ## Data Aggreating and Grouping 
# grouped_mean=data.groupby('chas')['rm'].mean()
# print(grouped_mean)

In [91]:
## Merging and joining DataFrame
# Create smaples 

df1=pd.DataFrame({'key':['A','B','C'],'Value1':[1,2,3]})
df2=pd.DataFrame({'key':['A','B','D'],'Value2':[4,5,6]})


In [92]:
df1

Unnamed: 0,key,Value1
0,A,1
1,B,2
2,C,3


In [93]:
df2

Unnamed: 0,key,Value2
0,A,4
1,B,5
2,D,6


In [94]:
## Merge DataFrame on the 'Key' columns 
pd.merge(df1,df2,on="key",how='inner')

Unnamed: 0,key,Value1,Value2
0,A,1,4
1,B,2,5


In [95]:
pd.merge(df1,df2,on="key",how='outer')

Unnamed: 0,key,Value1,Value2
0,A,1.0,4.0
1,B,2.0,5.0
2,C,3.0,
3,D,,6.0


In [96]:
pd.merge(df1,df2,on="key",how='left')

Unnamed: 0,key,Value1,Value2
0,A,1,4.0
1,B,2,5.0
2,C,3,


In [97]:
pd.merge(df1,df2,on="key",how='right')

Unnamed: 0,key,Value1,Value2
0,A,1.0,4
1,B,2.0,5
2,D,,6


## Reading Data From Different Sources 

In [98]:
data='{"employee_name":"Madhav","email":"ymadhav@gamil.com","job_profile": [{"title1":"Team Lead","title2":"Sr. Developer"}'

In [102]:
from io import StringIO

In [103]:
data

'{"employee_name":"Madhav","email":"ymadhav@gamil.com","job_profile": [{"title1":"Team Lead","title2":"Sr. Developer"}'

In [104]:
import pandas as pd
import json


In [109]:
!pip install lxml

Collecting lxml
  Downloading lxml-6.0.2-cp313-cp313-win_amd64.whl.metadata (3.7 kB)
Downloading lxml-6.0.2-cp313-cp313-win_amd64.whl (4.0 MB)
   ---------------------------------------- 0.0/4.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/4.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/4.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/4.0 MB ? eta -:--:--
   -- ------------------------------------- 0.3/4.0 MB ? eta -:--:--
   -- ------------------------------------- 0.3/4.0 MB ? eta -:--:--
   -- ------------------------------------- 0.3/4.0 MB ? eta -:--:--
   ----- ---------------------------------- 0.5/4.0 MB 402.7 kB/s eta 0:00:09
   ----- ---------------------------------- 0.5/4.0 MB 402.7 kB/s eta 0:00:09
   ----- ---------------------------------- 0.5/4.0 MB 402.7 kB/s eta 0:00:09
   ------- -------------------------------- 0.8/4.0 MB 412.6 kB/s eta 0:00:08
   ------- -------------------------------- 0.8/4.0 MB 412.6 k

In [118]:
df=pd.read_html('example.html')

In [120]:
df[0]

Unnamed: 0,Name,Age,Grade
0,Alice,10,5
1,Bob,11,6
2,Charlie,9,4


In [126]:
!pip install bs4

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2
