### What is Pandas?

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

### Pandas Series

A Pandas Series is like a column in a table. It is a 1-D array holding data of any type.

### Why use Pandas?
* Easy data cleaning and transformation.
* Efficient handling of large datasets.
* Integration with other Python libraries like Matplotlib and Seaborn for visualization.

### Key/Primary Data Structures in Pandas
* ### Series - Series is a one dimensional labeled array that can hold any data type.
* ### DataFrame - DataFrame is a two- dimensional labeled data structure consists of columns, each of which can hold different data types.

Pandas includes data cleaning ,data filtering ,aggregation, merging ,reshaping and more . It also integrates well with other popular Python libraries such as NumPy and Matplotlib.

### What is CSV?
* A CSV (Comma-Separated Values) file is a plain text file that stores tabular data (data orgainsed in rows and columns) in a structured format. It is a commonly used file format for storing and exchanging data between different software applications.

### History of Pandas:-
* It was created by Wes Mckinney and released in 2008.


### Installation Method :
1. Pandas Environment Setup :-
  * pip install pandas.
2. How we can use it :-
  * import pandas as pd.  

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Series Operations

In [135]:
#How to make series.
ser = pd.Series(np.random.rand(50))
print(ser)

0     0.300148
1     0.351465
2     0.211845
3     0.141259
4     0.013483
5     0.094913
6     0.366651
7     0.649618
8     0.103328
9     0.854823
10    0.823733
11    0.043041
12    0.647378
13    0.971052
14    0.928046
15    0.276852
16    0.830395
17    0.711892
18    0.498712
19    0.481921
20    0.942366
21    0.921874
22    0.336953
23    0.408575
24    0.529827
25    0.751711
26    0.334131
27    0.007396
28    0.997051
29    0.070323
30    0.306926
31    0.062836
32    0.545454
33    0.622744
34    0.863793
35    0.384216
36    0.711159
37    0.390751
38    0.434700
39    0.541793
40    0.649269
41    0.733603
42    0.132119
43    0.713394
44    0.891026
45    0.118435
46    0.230173
47    0.348603
48    0.264121
49    0.438723
dtype: float64


In [136]:
#for knowing the type of series
type(ser)

pandas.core.series.Series

In [137]:
#for knowing the index
ser.index

RangeIndex(start=0, stop=50, step=1)

In [138]:
#for knowing datatype of the series
ser.dtype

dtype('float64')

In [139]:
#for knowing the starting element
ser.head(25)

0     0.300148
1     0.351465
2     0.211845
3     0.141259
4     0.013483
5     0.094913
6     0.366651
7     0.649618
8     0.103328
9     0.854823
10    0.823733
11    0.043041
12    0.647378
13    0.971052
14    0.928046
15    0.276852
16    0.830395
17    0.711892
18    0.498712
19    0.481921
20    0.942366
21    0.921874
22    0.336953
23    0.408575
24    0.529827
dtype: float64

In [140]:
#for knowing the ending elements.
ser.tail(5)

45    0.118435
46    0.230173
47    0.348603
48    0.264121
49    0.438723
dtype: float64

In [141]:
#to converting dataframe/series into Arrays
ser.to_numpy()

array([0.30014774, 0.35146467, 0.21184512, 0.14125886, 0.01348345,
       0.09491341, 0.36665125, 0.6496184 , 0.10332809, 0.85482341,
       0.82373254, 0.04304061, 0.64737771, 0.97105155, 0.92804618,
       0.27685158, 0.8303946 , 0.71189151, 0.49871249, 0.48192084,
       0.94236642, 0.92187436, 0.33695289, 0.40857526, 0.52982721,
       0.7517111 , 0.3341306 , 0.00739607, 0.99705148, 0.07032303,
       0.3069255 , 0.06283638, 0.5454541 , 0.62274369, 0.86379274,
       0.38421614, 0.71115897, 0.3907507 , 0.43470038, 0.54179258,
       0.64926895, 0.73360314, 0.13211862, 0.71339428, 0.89102597,
       0.11843503, 0.23017308, 0.34860338, 0.26412058, 0.43872322])

### DataFrame Operations

In [142]:
#for creating the DataFrame
df = pd.DataFrame(np.random.rand(500,8),index = np.arange(500))
print(df)

            0         1         2         3         4         5         6  \
0    0.043573  0.242913  0.776838  0.775832  0.401016  0.069541  0.993391   
1    0.037827  0.324498  0.553638  0.493932  0.028505  0.413882  0.612692   
2    0.212233  0.862371  0.257922  0.166698  0.264961  0.209552  0.826419   
3    0.200016  0.292486  0.114299  0.155402  0.736417  0.749729  0.304528   
4    0.595790  0.041186  0.756767  0.294144  0.393813  0.062137  0.497824   
..        ...       ...       ...       ...       ...       ...       ...   
495  0.824870  0.838064  0.850449  0.079758  0.958482  0.567355  0.359682   
496  0.905538  0.035686  0.260313  0.027656  0.056817  0.040003  0.029204   
497  0.308490  0.774865  0.971633  0.726746  0.185933  0.460538  0.695751   
498  0.540125  0.675441  0.001357  0.871751  0.792678  0.882902  0.815541   
499  0.262783  0.479571  0.876111  0.508239  0.631704  0.101912  0.515363   

            7  
0    0.139137  
1    0.033953  
2    0.801596  
3    0.3779

In [143]:
#For knowing the datatype.
type(df)

pandas.core.frame.DataFrame

In [144]:
#for knowing about the columns.
df.columns

RangeIndex(start=0, stop=8, step=1)

In [145]:
#to converting dataframe/series into Arrays
df.to_numpy()


array([[0.04357313, 0.24291252, 0.77683846, ..., 0.06954141, 0.99339135,
        0.13913746],
       [0.03782697, 0.32449805, 0.55363783, ..., 0.41388154, 0.61269159,
        0.03395311],
       [0.21223315, 0.86237146, 0.25792185, ..., 0.20955234, 0.82641889,
        0.80159646],
       ...,
       [0.30848988, 0.7748655 , 0.97163291, ..., 0.46053812, 0.69575065,
        0.10595071],
       [0.54012515, 0.67544066, 0.00135702, ..., 0.88290189, 0.81554055,
        0.83332413],
       [0.26278284, 0.47957087, 0.87611136, ..., 0.10191208, 0.51536303,
        0.44559745]], shape=(500, 8))

### CSV (Comma Seperated Values) Operations

In [146]:
#for Creating a series or excel file from dataframe.
df.to_csv("code.csv")

In [147]:
#Starting records
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.043573,0.242913,0.776838,0.775832,0.401016,0.069541,0.993391,0.139137
1,0.037827,0.324498,0.553638,0.493932,0.028505,0.413882,0.612692,0.033953
2,0.212233,0.862371,0.257922,0.166698,0.264961,0.209552,0.826419,0.801596
3,0.200016,0.292486,0.114299,0.155402,0.736417,0.749729,0.304528,0.377985
4,0.59579,0.041186,0.756767,0.294144,0.393813,0.062137,0.497824,0.074738


In [148]:
#Ending records
df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7
495,0.82487,0.838064,0.850449,0.079758,0.958482,0.567355,0.359682,0.257891
496,0.905538,0.035686,0.260313,0.027656,0.056817,0.040003,0.029204,0.160753
497,0.30849,0.774865,0.971633,0.726746,0.185933,0.460538,0.695751,0.105951
498,0.540125,0.675441,0.001357,0.871751,0.792678,0.882902,0.815541,0.833324
499,0.262783,0.479571,0.876111,0.508239,0.631704,0.101912,0.515363,0.445597


In [149]:
#importing/Reading CSV File.
newdf = pd.read_csv("code.csv")
print(newdf)

     Unnamed: 0         0         1         2         3         4         5  \
0             0  0.043573  0.242913  0.776838  0.775832  0.401016  0.069541   
1             1  0.037827  0.324498  0.553638  0.493932  0.028505  0.413882   
2             2  0.212233  0.862371  0.257922  0.166698  0.264961  0.209552   
3             3  0.200016  0.292486  0.114299  0.155402  0.736417  0.749729   
4             4  0.595790  0.041186  0.756767  0.294144  0.393813  0.062137   
..          ...       ...       ...       ...       ...       ...       ...   
495         495  0.824870  0.838064  0.850449  0.079758  0.958482  0.567355   
496         496  0.905538  0.035686  0.260313  0.027656  0.056817  0.040003   
497         497  0.308490  0.774865  0.971633  0.726746  0.185933  0.460538   
498         498  0.540125  0.675441  0.001357  0.871751  0.792678  0.882902   
499         499  0.262783  0.479571  0.876111  0.508239  0.631704  0.101912   

            6         7  
0    0.993391  0.139137  

In [164]:
#For removing the index
df.to_csv("index_false.csv",index=False)

PermissionError: [Errno 13] Permission denied: 'index_false.csv'

In [None]:
newdf2 = pd.read_csv("index_false.csv")
print(newdf2)

            0         1         2         3         4         5         6  \
0    0.491064  0.137624  0.767092  0.808754  0.081708  0.815857  0.595985   
1    0.616215  0.190544  0.273897  0.052844  0.998048  0.631877  0.419818   
2    0.282608  0.809215  0.097414  0.166414  0.105041  0.435520  0.456667   
3    0.175790  0.498210  0.427152  0.470415  0.394824  0.953192  0.182492   
4    0.986110  0.600607  0.618594  0.036768  0.200630  0.593108  0.199204   
..        ...       ...       ...       ...       ...       ...       ...   
495  0.422657  0.269586  0.989281  0.791910  0.700522  0.523557  0.894300   
496  0.098266  0.772520  0.967135  0.705640  0.159286  0.635730  0.293501   
497  0.465451  0.767535  0.779935  0.838888  0.401245  0.383043  0.681099   
498  0.534334  0.635509  0.078362  0.620522  0.211123  0.480846  0.720277   
499  0.363593  0.071777  0.080730  0.879670  0.011276  0.153994  0.707142   

            7  
0    0.802316  
1    0.171715  
2    0.087439  
3    0.6396

In [3]:
runs = pd.read_csv("batter.csv")
runs

Unnamed: 0,batter,runs,avg,strike_rate
0,V Kohli,6634,36.251366,125.977972
1,S Dhawan,6244,34.882682,122.840842
2,DA Warner,5883,41.429577,136.401577
3,RG Sharma,5881,30.314433,126.964594
4,SK Raina,5536,32.374269,132.535312
...,...,...,...,...
600,C Nanda,0,0.000000,0.000000
601,Akash Deep,0,0.000000,0.000000
602,S Ladda,0,0.000000,0.000000
603,V Pratap Singh,0,0.000000,0.000000


In [None]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.491064,0.137624,0.767092,0.808754,0.081708,0.815857,0.595985,0.802316
1,0.616215,0.190544,0.273897,0.052844,0.998048,0.631877,0.419818,0.171715
2,0.282608,0.809215,0.097414,0.166414,0.105041,0.43552,0.456667,0.087439
3,0.17579,0.49821,0.427152,0.470415,0.394824,0.953192,0.182492,0.639659
4,0.98611,0.600607,0.618594,0.036768,0.20063,0.593108,0.199204,0.575865


In [None]:
#for creating statistical values/computing
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,0.502499,0.515558,0.488835,0.499112,0.49022,0.470393,0.510734,0.517981
std,0.290694,0.292958,0.296429,0.285324,0.282215,0.289341,0.280953,0.293076
min,0.004249,0.004314,0.000132,0.002193,0.000174,0.002077,0.00155,0.001817
25%,0.236571,0.264552,0.237575,0.254817,0.240766,0.215562,0.281445,0.263129
50%,0.508055,0.533815,0.47065,0.489774,0.481457,0.443269,0.499144,0.529325
75%,0.753088,0.771075,0.741076,0.729327,0.745384,0.720524,0.740812,0.77139
max,0.999585,0.99868,0.997428,0.999774,0.993263,0.999279,0.99803,0.998778


In [None]:
#for knowing the value at a specific place
df.loc[0,2]

np.float64(0.7670923417238656)

In [174]:
#for changing the column name
df.columns= list("abcdefgh")
print(df)

ValueError: Length mismatch: Expected axis has 6 elements, new values have 8 elements

In [None]:
#loc works on the name of the row and column
df.loc[0,"c"]
df.loc[497,"f"]

np.float64(0.3830429439900729)

In [161]:
#for deleting the column.
df.drop(2,axis=1)      #axis 0 = rows   axis 1 = cols

Unnamed: 0,0,1,3,4,5,6,7
0,0.043573,0.242913,0.775832,0.401016,0.069541,0.993391,0.139137
1,0.037827,0.324498,0.493932,0.028505,0.413882,0.612692,0.033953
2,0.212233,0.862371,0.166698,0.264961,0.209552,0.826419,0.801596
3,0.200016,0.292486,0.155402,0.736417,0.749729,0.304528,0.377985
4,0.595790,0.041186,0.294144,0.393813,0.062137,0.497824,0.074738
...,...,...,...,...,...,...,...
495,0.824870,0.838064,0.079758,0.958482,0.567355,0.359682,0.257891
496,0.905538,0.035686,0.027656,0.056817,0.040003,0.029204,0.160753
497,0.308490,0.774865,0.726746,0.185933,0.460538,0.695751,0.105951
498,0.540125,0.675441,0.871751,0.792678,0.882902,0.815541,0.833324


In [None]:
#for deleting the row
df.drop(0,axis=0)

Unnamed: 0,0,1,2,3,4,5,6,7
1,0.466856,0.359871,0.747937,0.069733,0.796018,0.364645,0.492374,0.585879
2,0.058139,0.360487,0.286230,0.961849,0.305600,0.264418,0.020083,0.582199
3,0.999342,0.505364,0.847086,0.271741,0.349743,0.421965,0.937672,0.846392
4,0.762178,0.461737,0.084674,0.404735,0.940679,0.981773,0.417738,0.950390
5,0.351997,0.810319,0.706487,0.363941,0.582671,0.663544,0.896073,0.512005
...,...,...,...,...,...,...,...,...
495,0.062884,0.513508,0.365027,0.694540,0.510541,0.161570,0.746785,0.780570
496,0.711427,0.410717,0.682348,0.187856,0.462389,0.640372,0.869880,0.261745
497,0.688082,0.867054,0.527140,0.438868,0.812595,0.563938,0.829167,0.385654
498,0.106520,0.722323,0.769736,0.719717,0.880508,0.779223,0.146017,0.786235


In [None]:
#iloc[works on the index]
df.iloc[5,4]

np.float64(0.2504780011753588)

In [None]:
df.iloc[0,1]

np.float64(0.13762442302411226)

In [177]:
#Changing the index..
df["a"][2]="Learn Coding"
df


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df["a"][2]="Learn Coding"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["a"][2]="Learn Coding"


Unnamed: 0,a,b,e,f,g,h
0,0.043573,0.242913,0.401016,0.069541,0.993391,0.139137
1,0.037827,0.324498,0.028505,0.413882,0.612692,0.033953
2,Learn Coding,0.862371,0.264961,0.209552,0.826419,0.801596
3,0.200016,0.292486,0.736417,0.749729,0.304528,0.377985
4,0.59579,0.041186,0.393813,0.062137,0.497824,0.074738
...,...,...,...,...,...,...
495,0.82487,0.838064,0.958482,0.567355,0.359682,0.257891
496,0.905538,0.035686,0.056817,0.040003,0.029204,0.160753
497,0.30849,0.774865,0.185933,0.460538,0.695751,0.105951
498,0.540125,0.675441,0.792678,0.882902,0.815541,0.833324


In [184]:
#For deleting
df.drop(["a","b"],axis=1,inplace=True)
df

Unnamed: 0,g,h
0,0.993391,0.139137
1,0.612692,0.033953
2,0.826419,0.801596
3,0.304528,0.377985
4,0.497824,0.074738
...,...,...
495,0.359682,0.257891
496,0.029204,0.160753
497,0.695751,0.105951
498,0.815541,0.833324


In [180]:
df["g"].isnull()

0      False
1      False
2      False
3      False
4      False
       ...  
495    False
496    False
497    False
498    False
499    False
Name: g, Length: 500, dtype: bool

In [185]:
df.dropna()

Unnamed: 0,g,h
0,0.993391,0.139137
1,0.612692,0.033953
2,0.826419,0.801596
3,0.304528,0.377985
4,0.497824,0.074738
...,...,...
495,0.359682,0.257891
496,0.029204,0.160753
497,0.695751,0.105951
498,0.815541,0.833324


In [196]:
dict={
    "Name":["John","Trupti","Bob","Rahul","Rohit"],
    "Marks":[42,45,48,32,44],
    "Sports":["Cricket","Kabbadi","Football","Basketball","Hockey"]
}


In [197]:
school  = pd.DataFrame(dict)
school

Unnamed: 0,Name,Marks,Sports
0,John,42,Cricket
1,Trupti,45,Kabbadi
2,Bob,48,Football
3,Rahul,32,Basketball
4,Rohit,44,Hockey


In [208]:
school=pd.read_csv("School.csv")

In [199]:
school.to_csv("school_index_false.csv",index=False)

In [6]:
#read data from CSV
ir = pd.read_csv("iris.csv")
ir

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


In [31]:
#read data from excel
import pandas as pd

# Read data from Excel file
sagar = pd.read_excel(r"C:\Users\Arnob\OneDrive\Desktop\Sagar mall.xlsx")
sagar
sagar.head(8)


Unnamed: 0,Sagar Super mall,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13
0,,,,,,,,,,,,,,
1,Items,qty,price,price,,,,,,,,,,
2,rice,1kg,50,40,,,,,,,,,,
3,pulses,250g,20,75,,,,,,,,,,
4,egg,12pcs,50,70,,,,,,,,,,
5,sugar,1kg,90,40,,,,,,,,,,
6,oil,5ltr,300,530,,,,,,,,,,
7,total,,510,151,,,,,,,,,,


In [38]:
#Creating a Data Frame

data ={
    "Name":["Ram","Shyam","Gyan"],
    "Age" :[15,18,21],
    "City":["Ahmedabad","Nagpur","Surat"]
}

office = pd.DataFrame(data)
office


Unnamed: 0,Name,Age,City
0,Ram,15,Ahmedabad
1,Shyam,18,Nagpur
2,Gyan,21,Surat


### Converting DataFrame into CSV , Excel and Json

In [None]:
#Converting DataFrame into CSV
office.to_csv("office.csv",index=False)

In [42]:
#Converting DataFrame into Excel
office.to_excel("office.xlsx",index=False)

In [43]:
#Converting DataFrame into Json
office.to_json("office.json",index=False)

### Why do we need to explore data :
1. Understand the Data set.
2. Identify the problems.
3. Plan next steps.

In [47]:
#Reading a csv file
ipl = pd.read_csv("ipl-matches.csv")
ipl

Unnamed: 0,ID,City,Date,Season,MatchNumber,Team1,Team2,Venue,TossWinner,TossDecision,SuperOver,WinningTeam,WonBy,Margin,method,Player_of_Match,Team1Players,Team2Players,Umpire1,Umpire2
0,1312200,Ahmedabad,2022-05-29,2022,Final,Rajasthan Royals,Gujarat Titans,"Narendra Modi Stadium, Ahmedabad",Rajasthan Royals,bat,N,Gujarat Titans,Wickets,7.0,,HH Pandya,"['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...","['WP Saha', 'Shubman Gill', 'MS Wade', 'HH Pan...",CB Gaffaney,Nitin Menon
1,1312199,Ahmedabad,2022-05-27,2022,Qualifier 2,Royal Challengers Bangalore,Rajasthan Royals,"Narendra Modi Stadium, Ahmedabad",Rajasthan Royals,field,N,Rajasthan Royals,Wickets,7.0,,JC Buttler,"['V Kohli', 'F du Plessis', 'RM Patidar', 'GJ ...","['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...",CB Gaffaney,Nitin Menon
2,1312198,Kolkata,2022-05-25,2022,Eliminator,Royal Challengers Bangalore,Lucknow Super Giants,"Eden Gardens, Kolkata",Lucknow Super Giants,field,N,Royal Challengers Bangalore,Runs,14.0,,RM Patidar,"['V Kohli', 'F du Plessis', 'RM Patidar', 'GJ ...","['Q de Kock', 'KL Rahul', 'M Vohra', 'DJ Hooda...",J Madanagopal,MA Gough
3,1312197,Kolkata,2022-05-24,2022,Qualifier 1,Rajasthan Royals,Gujarat Titans,"Eden Gardens, Kolkata",Gujarat Titans,field,N,Gujarat Titans,Wickets,7.0,,DA Miller,"['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...","['WP Saha', 'Shubman Gill', 'MS Wade', 'HH Pan...",BNJ Oxenford,VK Sharma
4,1304116,Mumbai,2022-05-22,2022,70,Sunrisers Hyderabad,Punjab Kings,"Wankhede Stadium, Mumbai",Sunrisers Hyderabad,bat,N,Punjab Kings,Wickets,5.0,,Harpreet Brar,"['PK Garg', 'Abhishek Sharma', 'RA Tripathi', ...","['JM Bairstow', 'S Dhawan', 'M Shahrukh Khan',...",AK Chaudhary,NA Patwardhan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
945,335986,Kolkata,2008-04-20,2007/08,4,Kolkata Knight Riders,Deccan Chargers,Eden Gardens,Deccan Chargers,bat,N,Kolkata Knight Riders,Wickets,5.0,,DJ Hussey,"['WP Saha', 'BB McCullum', 'RT Ponting', 'SC G...","['AC Gilchrist', 'Y Venugopal Rao', 'VVS Laxma...",BF Bowden,K Hariharan
946,335985,Mumbai,2008-04-20,2007/08,5,Mumbai Indians,Royal Challengers Bangalore,Wankhede Stadium,Mumbai Indians,bat,N,Royal Challengers Bangalore,Wickets,5.0,,MV Boucher,"['L Ronchi', 'ST Jayasuriya', 'DJ Thornely', '...","['S Chanderpaul', 'R Dravid', 'LRPL Taylor', '...",SJ Davis,DJ Harper
947,335984,Delhi,2008-04-19,2007/08,3,Delhi Daredevils,Rajasthan Royals,Feroz Shah Kotla,Rajasthan Royals,bat,N,Delhi Daredevils,Wickets,9.0,,MF Maharoof,"['G Gambhir', 'V Sehwag', 'S Dhawan', 'MK Tiwa...","['T Kohli', 'YK Pathan', 'SR Watson', 'M Kaif'...",Aleem Dar,GA Pratapkumar
948,335983,Chandigarh,2008-04-19,2007/08,2,Kings XI Punjab,Chennai Super Kings,"Punjab Cricket Association Stadium, Mohali",Chennai Super Kings,bat,N,Chennai Super Kings,Runs,33.0,,MEK Hussey,"['K Goel', 'JR Hopes', 'KC Sangakkara', 'Yuvra...","['PA Patel', 'ML Hayden', 'MEK Hussey', 'MS Dh...",MR Benson,SL Shastri


In [51]:
#head()
ipl.head(10)

Unnamed: 0,ID,City,Date,Season,MatchNumber,Team1,Team2,Venue,TossWinner,TossDecision,SuperOver,WinningTeam,WonBy,Margin,method,Player_of_Match,Team1Players,Team2Players,Umpire1,Umpire2
0,1312200,Ahmedabad,2022-05-29,2022,Final,Rajasthan Royals,Gujarat Titans,"Narendra Modi Stadium, Ahmedabad",Rajasthan Royals,bat,N,Gujarat Titans,Wickets,7.0,,HH Pandya,"['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...","['WP Saha', 'Shubman Gill', 'MS Wade', 'HH Pan...",CB Gaffaney,Nitin Menon
1,1312199,Ahmedabad,2022-05-27,2022,Qualifier 2,Royal Challengers Bangalore,Rajasthan Royals,"Narendra Modi Stadium, Ahmedabad",Rajasthan Royals,field,N,Rajasthan Royals,Wickets,7.0,,JC Buttler,"['V Kohli', 'F du Plessis', 'RM Patidar', 'GJ ...","['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...",CB Gaffaney,Nitin Menon
2,1312198,Kolkata,2022-05-25,2022,Eliminator,Royal Challengers Bangalore,Lucknow Super Giants,"Eden Gardens, Kolkata",Lucknow Super Giants,field,N,Royal Challengers Bangalore,Runs,14.0,,RM Patidar,"['V Kohli', 'F du Plessis', 'RM Patidar', 'GJ ...","['Q de Kock', 'KL Rahul', 'M Vohra', 'DJ Hooda...",J Madanagopal,MA Gough
3,1312197,Kolkata,2022-05-24,2022,Qualifier 1,Rajasthan Royals,Gujarat Titans,"Eden Gardens, Kolkata",Gujarat Titans,field,N,Gujarat Titans,Wickets,7.0,,DA Miller,"['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...","['WP Saha', 'Shubman Gill', 'MS Wade', 'HH Pan...",BNJ Oxenford,VK Sharma
4,1304116,Mumbai,2022-05-22,2022,70,Sunrisers Hyderabad,Punjab Kings,"Wankhede Stadium, Mumbai",Sunrisers Hyderabad,bat,N,Punjab Kings,Wickets,5.0,,Harpreet Brar,"['PK Garg', 'Abhishek Sharma', 'RA Tripathi', ...","['JM Bairstow', 'S Dhawan', 'M Shahrukh Khan',...",AK Chaudhary,NA Patwardhan
5,1304115,Mumbai,2022-05-21,2022,69,Delhi Capitals,Mumbai Indians,"Wankhede Stadium, Mumbai",Mumbai Indians,field,N,Mumbai Indians,Wickets,5.0,,JJ Bumrah,"['PP Shaw', 'DA Warner', 'MR Marsh', 'RR Pant'...","['Ishan Kishan', 'RG Sharma', 'D Brevis', 'Til...",Nitin Menon,Tapan Sharma
6,1304114,Mumbai,2022-05-20,2022,68,Chennai Super Kings,Rajasthan Royals,"Brabourne Stadium, Mumbai",Chennai Super Kings,bat,N,Rajasthan Royals,Wickets,5.0,,R Ashwin,"['RD Gaikwad', 'DP Conway', 'MM Ali', 'N Jagad...","['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...",CB Gaffaney,NA Patwardhan
7,1304113,Mumbai,2022-05-19,2022,67,Gujarat Titans,Royal Challengers Bangalore,"Wankhede Stadium, Mumbai",Gujarat Titans,bat,N,Royal Challengers Bangalore,Wickets,8.0,,V Kohli,"['WP Saha', 'Shubman Gill', 'MS Wade', 'HH Pan...","['V Kohli', 'F du Plessis', 'GJ Maxwell', 'KD ...",KN Ananthapadmanabhan,GR Sadashiv Iyer
8,1304112,Navi Mumbai,2022-05-18,2022,66,Lucknow Super Giants,Kolkata Knight Riders,"Dr DY Patil Sports Academy, Mumbai",Lucknow Super Giants,bat,N,Lucknow Super Giants,Runs,2.0,,Q de Kock,"['Q de Kock', 'KL Rahul', 'E Lewis', 'DJ Hooda...","['VR Iyer', 'A Tomar', 'N Rana', 'SS Iyer', 'S...",R Pandit,YC Barde
9,1304111,Mumbai,2022-05-17,2022,65,Sunrisers Hyderabad,Mumbai Indians,"Wankhede Stadium, Mumbai",Mumbai Indians,field,N,Sunrisers Hyderabad,Runs,3.0,,RA Tripathi,"['Abhishek Sharma', 'PK Garg', 'RA Tripathi', ...","['RG Sharma', 'Ishan Kishan', 'DR Sams', 'Tila...",CB Gaffaney,N Pandit


In [52]:
#Tail()
ipl.tail(10)

Unnamed: 0,ID,City,Date,Season,MatchNumber,Team1,Team2,Venue,TossWinner,TossDecision,SuperOver,WinningTeam,WonBy,Margin,method,Player_of_Match,Team1Players,Team2Players,Umpire1,Umpire2
940,335991,Chandigarh,2008-04-25,2007/08,10,Kings XI Punjab,Mumbai Indians,"Punjab Cricket Association Stadium, Mohali",Mumbai Indians,field,N,Kings XI Punjab,Runs,66.0,,KC Sangakkara,"['K Goel', 'IK Pathan', 'KC Sangakkara', 'Yuvr...","['L Ronchi', 'ST Jayasuriya', 'RV Uthappa', 'D...",Aleem Dar,AM Saheba
941,335990,Hyderabad,2008-04-24,2007/08,9,Deccan Chargers,Rajasthan Royals,"Rajiv Gandhi International Stadium, Uppal",Rajasthan Royals,field,N,Rajasthan Royals,Wickets,3.0,,YK Pathan,"['AC Gilchrist', 'VVS Laxman', 'Shahid Afridi'...","['Kamran Akmal', 'GC Smith', 'YK Pathan', 'SR ...",Asad Rauf,MR Benson
942,335989,Chennai,2008-04-23,2007/08,8,Chennai Super Kings,Mumbai Indians,"MA Chidambaram Stadium, Chepauk",Mumbai Indians,field,N,Chennai Super Kings,Runs,6.0,,ML Hayden,"['PA Patel', 'ML Hayden', 'MEK Hussey', 'SK Ra...","['L Ronchi', 'ST Jayasuriya', 'RV Uthappa', 'S...",DJ Harper,GA Pratapkumar
943,335988,Hyderabad,2008-04-22,2007/08,7,Deccan Chargers,Delhi Daredevils,"Rajiv Gandhi International Stadium, Uppal",Deccan Chargers,bat,N,Delhi Daredevils,Wickets,9.0,,V Sehwag,"['AC Gilchrist', 'Y Venugopal Rao', 'VVS Laxma...","['G Gambhir', 'V Sehwag', 'S Dhawan', 'Shoaib ...",IL Howell,AM Saheba
944,335987,Jaipur,2008-04-21,2007/08,6,Rajasthan Royals,Kings XI Punjab,Sawai Mansingh Stadium,Kings XI Punjab,bat,N,Rajasthan Royals,Wickets,6.0,,SR Watson,"['M Kaif', 'Kamran Akmal', 'YK Pathan', 'SR Wa...","['K Goel', 'JR Hopes', 'KC Sangakkara', 'DPMD ...",Aleem Dar,RB Tiffin
945,335986,Kolkata,2008-04-20,2007/08,4,Kolkata Knight Riders,Deccan Chargers,Eden Gardens,Deccan Chargers,bat,N,Kolkata Knight Riders,Wickets,5.0,,DJ Hussey,"['WP Saha', 'BB McCullum', 'RT Ponting', 'SC G...","['AC Gilchrist', 'Y Venugopal Rao', 'VVS Laxma...",BF Bowden,K Hariharan
946,335985,Mumbai,2008-04-20,2007/08,5,Mumbai Indians,Royal Challengers Bangalore,Wankhede Stadium,Mumbai Indians,bat,N,Royal Challengers Bangalore,Wickets,5.0,,MV Boucher,"['L Ronchi', 'ST Jayasuriya', 'DJ Thornely', '...","['S Chanderpaul', 'R Dravid', 'LRPL Taylor', '...",SJ Davis,DJ Harper
947,335984,Delhi,2008-04-19,2007/08,3,Delhi Daredevils,Rajasthan Royals,Feroz Shah Kotla,Rajasthan Royals,bat,N,Delhi Daredevils,Wickets,9.0,,MF Maharoof,"['G Gambhir', 'V Sehwag', 'S Dhawan', 'MK Tiwa...","['T Kohli', 'YK Pathan', 'SR Watson', 'M Kaif'...",Aleem Dar,GA Pratapkumar
948,335983,Chandigarh,2008-04-19,2007/08,2,Kings XI Punjab,Chennai Super Kings,"Punjab Cricket Association Stadium, Mohali",Chennai Super Kings,bat,N,Chennai Super Kings,Runs,33.0,,MEK Hussey,"['K Goel', 'JR Hopes', 'KC Sangakkara', 'Yuvra...","['PA Patel', 'ML Hayden', 'MEK Hussey', 'MS Dh...",MR Benson,SL Shastri
949,335982,Bangalore,2008-04-18,2007/08,1,Royal Challengers Bangalore,Kolkata Knight Riders,M Chinnaswamy Stadium,Royal Challengers Bangalore,field,N,Kolkata Knight Riders,Runs,140.0,,BB McCullum,"['R Dravid', 'W Jaffer', 'V Kohli', 'JH Kallis...","['SC Ganguly', 'BB McCullum', 'RT Ponting', 'D...",Asad Rauf,RE Koertzen


### Understanding the Data.
1. Columns , Rows ?
2. What Type of Data ?
3. Missing Data ?

### Solution -  By summarizing your Data.

info()   - it is a method not function.

1. Number of rows and co1umns.
2. Column name.
3. dtype (int64) (float64) (object).
4. non null counts.
5. memory usage of the DataFrame.


In [55]:
#Displaying the info of DataFrame    info()...
ipl = pd.read_csv("ipl-matches.csv")
ipl.info()                               #Summary of this Data set

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               950 non-null    int64  
 1   City             899 non-null    object 
 2   Date             950 non-null    object 
 3   Season           950 non-null    object 
 4   MatchNumber      950 non-null    object 
 5   Team1            950 non-null    object 
 6   Team2            950 non-null    object 
 7   Venue            950 non-null    object 
 8   TossWinner       950 non-null    object 
 9   TossDecision     950 non-null    object 
 10  SuperOver        946 non-null    object 
 11  WinningTeam      946 non-null    object 
 12  WonBy            950 non-null    object 
 13  Margin           932 non-null    float64
 14  method           19 non-null     object 
 15  Player_of_Match  946 non-null    object 
 16  Team1Players     950 non-null    object 
 17  Team2Players    

In [None]:
#Displaying the info of DataFrame    info()...
office = pd.read_csv("office.csv")
office.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   City    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 204.0+ bytes


### Describe Method - Summary Statistics

In [68]:
# Describe Method....
data = {
    "Name":["Ram","Shyam","Gattu","Arnob","Aditi","Adil","Raj","Simran"],
    "Age" :[24,26,37,29,23,21,45,61],
    "Salary":[50000,60000,70000,100000,45000,55000,85000,94000],
    "Performance Score":[85,89,90,75,81,78,95,93]
}

perform=pd.DataFrame(data)
perform

Unnamed: 0,Name,Age,Salary,Performance Score
0,Ram,24,50000,85
1,Shyam,26,60000,89
2,Gattu,37,70000,90
3,Arnob,29,100000,75
4,Aditi,23,45000,81
5,Adil,21,55000,78
6,Raj,45,85000,95
7,Simran,61,94000,93


In [67]:
perform.to_csv("perform.csv",index=False)

In [66]:
# Describe Method - Statistics.
perform.describe()

Unnamed: 0,Age,Salary,Performance Score
count,8.0,8.0,8.0
mean,33.25,69875.0,85.75
std,13.802174,20876.764254,7.225945
min,21.0,45000.0,75.0
25%,23.75,53750.0,80.25
50%,27.5,65000.0,87.0
75%,39.0,87250.0,90.75
max,61.0,100000.0,95.0


# Before Analyzing or Manipulating a Datasets. 
1. How big is your Dataset?
2. What are the names of your columns? 

# Solutions

shape and columns -        They are attributes not methods.

In [None]:
#Shape and Columns.....   Tuple use only  {}

data = {
    "Name":["Ram","Shyam","Gattu","Arnob","Aditi","Adil","Raj","Simran"],
    "Age" :[24,26,37,29,23,21,45,61],
    "Salary":[50000,60000,70000,100000,45000,55000,85000,94000],
    "Performance Score":[85,89,90,75,81,78,95,93]
}

perform=pd.DataFrame(data)
print(perform)

#Shape using Tuple {}
print(f"Shape: {perform.shape}")

#Columns using Tuple {}
print(f"Column Names: {perform.columns}")

     Name  Age  Salary  Performance Score
0     Ram   24   50000                 85
1   Shyam   26   60000                 89
2   Gattu   37   70000                 90
3   Arnob   29  100000                 75
4   Aditi   23   45000                 81
5    Adil   21   55000                 78
6     Raj   45   85000                 95
7  Simran   61   94000                 93
Shape: (8, 4)
Column Names: Index(['Name', 'Age', 'Salary', 'Performance Score'], dtype='object')


# Selecting , Filtering , Indexing of Rows and Columns.
1. Select specific column.
2. Filtering rows.
3. Combine multiple conditions.

# Solutions.

1. Square brackets.
2. Boolean conditions.

# Selecting Columns.
1. A series.
2. DataFrame multiple columns of Data.

column = df["Column Name"]               Access single column...
subset = df["Column1","Column2","....."]  Access multiple columns...

# Filtering Rows.
 Boolean indexing.


# Based on a single condition.

filtered_Rows = df[df["Salary"] > 50000] 

# Combine multiple conditions.

filtered_Rows = df[(df["Column1"] > 50000) & (df["Column2"] < 80000)] 



All will be in List only []



In [4]:
data = {
    "Name":["Ram","Shyam","Gattu","Arnob","Aditi","Adil","Raj","Simran"],
    "Age" :[24,26,37,29,23,21,45,61],
    "Salary":[50000,60000,70000,100000,45000,55000,85000,94000],
    "Performance Score":[85,89,90,75,81,78,95,93]
}

perform=pd.DataFrame(data)
print(perform)

     Name  Age  Salary  Performance Score
0     Ram   24   50000                 85
1   Shyam   26   60000                 89
2   Gattu   37   70000                 90
3   Arnob   29  100000                 75
4   Aditi   23   45000                 81
5    Adil   21   55000                 78
6     Raj   45   85000                 95
7  Simran   61   94000                 93


In [None]:
#Selecting Single Column Return series...
name = perform["Name"]
print(name)

0       Ram
1     Shyam
2     Gattu
3     Arnob
4     Aditi
5      Adil
6       Raj
7    Simran
Name: Name, dtype: object


In [88]:
#Selecting Multiple Columns...
subset = perform[["Name","Salary","Age"]]
print(subset)

     Name  Salary  Age
0     Ram   50000   24
1   Shyam   60000   26
2   Gattu   70000   37
3   Arnob  100000   29
4   Aditi   45000   23
5    Adil   55000   21
6     Raj   85000   45
7  Simran   94000   61


In [None]:
#Filtering Rows. Based on Single Condition....         salary > 60k      AND condition
high_salary = perform[perform["Salary"] > 60000]
print(high_salary)

     Name  Age  Salary  Performance Score
2   Gattu   37   70000                 90
3   Arnob   29  100000                 75
6     Raj   45   85000                 95
7  Simran   61   94000                 93


In [None]:
#Filtering Rows. Based on Multiple Conditions....       salary >= 70k and age > 20   AND (&) conditon
filtered = perform[(perform["Age"] > 20) & (perform["Salary"] >= 70000)]
print(filtered)



     Name  Age  Salary  Performance Score
2   Gattu   37   70000                 90
3   Arnob   29  100000                 75
6     Raj   45   85000                 95
7  Simran   61   94000                 93


In [5]:
#Using OR (|) condition
filtered_or = perform[(perform["Age"] < 35) | (perform["Performance Score"] > 90)]
print(filtered_or)

     Name  Age  Salary  Performance Score
0     Ram   24   50000                 85
1   Shyam   26   60000                 89
3   Arnob   29  100000                 75
4   Aditi   23   45000                 81
5    Adil   21   55000                 78
6     Raj   45   85000                 95
7  Simran   61   94000                 93


### Modifications of Datasets.
1. Adding columns via Assignment .             # df["Column_Name"] = some_Data     #squae brackets
2. Using insert  .                              # df.insert(loc, "Column_Name", some_data)
   Update value .   #.loc[]                      # df.loc[row_index, "Column_Name"] = new_value

3. Removing columns   #.drop()                    # df.drop(columns = ["ColumnName"], inplace=True)

In [4]:
#Adding columns via assignment..    #square Brackets 

data = {
    "Name":["Ram","Shyam","Gattu","Arnob","Aditi","Adil","Raj","Simran"],
    "Age" :[24,26,37,29,23,21,45,61],
    "Salary":[50000,60000,70000,100000,45000,55000,85000,94000],
    "Performance Score":[85,89,90,75,81,78,95,93]
}

perform=pd.DataFrame(data)
print(perform)

#Bous increase by 10 %....

perform["Bonus"] = perform["Salary"] * 0.1          # 1st method
print(perform)



     Name  Age  Salary  Performance Score
0     Ram   24   50000                 85
1   Shyam   26   60000                 89
2   Gattu   37   70000                 90
3   Arnob   29  100000                 75
4   Aditi   23   45000                 81
5    Adil   21   55000                 78
6     Raj   45   85000                 95
7  Simran   61   94000                 93
     Name  Age  Salary  Performance Score    Bonus
0     Ram   24   50000                 85   5000.0
1   Shyam   26   60000                 89   6000.0
2   Gattu   37   70000                 90   7000.0
3   Arnob   29  100000                 75  10000.0
4   Aditi   23   45000                 81   4500.0
5    Adil   21   55000                 78   5500.0
6     Raj   45   85000                 95   8500.0
7  Simran   61   94000                 93   9400.0


In [15]:
#using insert()   method.
perform.insert(0,"l", [10,20,30,40,50,60,70,80])
print(perform)

    l  Employee ID    Name  Age  Salary  Performance Score    Bonus
0  10           10     Ram   24   50000                 85   5000.0
1  20           20   Shyam   26   60000                 89   6000.0
2  30           30   Gattu   37   70000                 90   7000.0
3  40           40   Arnob   29  100000                 75  10000.0
4  50           50   Aditi   23   45000                 81   4500.0
5  60           60    Adil   21   55000                 78   5500.0
6  70           70     Raj   45   85000                 95   8500.0
7  80           80  Simran   61   94000                 93   9400.0


In [None]:
# .loc       #updating Ram Salary..

perform.loc[0,"Salary"] = 67000         #its change from 50000 to 67000
print(perform)


    l  Employee ID    Name  Age  Salary  Performance Score    Bonus
0  10           10     Ram   24   67000                 85   5000.0
1  20           20   Shyam   26   60000                 89   6000.0
2  30           30   Gattu   37   70000                 90   7000.0
3  40           40   Arnob   29  100000                 75  10000.0
4  50           50   Aditi   23   45000                 81   4500.0
5  60           60    Adil   21   55000                 78   5500.0
6  70           70     Raj   45   85000                 95   8500.0
7  80           80  Simran   61   94000                 93   9400.0


In [18]:
#increasing salary by 5%
perform["Salary"] = perform["Salary"] * 1.5

print(perform)


    l  Employee ID    Name  Age    Salary  Performance Score    Bonus
0  10           10     Ram   24  100500.0                 85   5000.0
1  20           20   Shyam   26   90000.0                 89   6000.0
2  30           30   Gattu   37  105000.0                 90   7000.0
3  40           40   Arnob   29  150000.0                 75  10000.0
4  50           50   Aditi   23   67500.0                 81   4500.0
5  60           60    Adil   21   82500.0                 78   5500.0
6  70           70     Raj   45  127500.0                 95   8500.0
7  80           80  Simran   61  141000.0                 93   9400.0


In [11]:
# Removing columns   #.drop() 
print("Modified Dataset")

perform.drop(columns=["Performance Score"],inplace=True)
print(perform)


Modified Dataset
     Name  Age  Salary
0     Ram   24   50000
1   Shyam   26   60000
2   Gattu   37   70000
3   Arnob   29  100000
4   Aditi   23   45000
5    Adil   21   55000
6     Raj   45   85000
7  Simran   61   94000


# Handling Missing data.
NaN (Not a number)
None (for object data types)

* df.isnull()  - is a method which returns Boolean DataFrame means  (True - NaN is missing )    and   ( False - Value is Present)
* df.isnull().sum()  - detecting  how many values are missing.
* df.dropna()      - is to drop rows/columns with missing values.  axis=0 is rows   /  axis=1 is columns
* df.fillna(value,inplace=True)         - fill in missing values

In [None]:
#detecting missing data    #.isnull()
data = {
    "Name":["Ram",None,"Gattu","Arnob","Aditi","Adil","Raj","Simran"],
    "Age" :[24,None,37,29,23,21,45,61],
    "Salary":[50000,None,70000,100000,45000,55000,85000,94000],
    "Performance Score":[85,None,90,75,81,78,95,93]
}

perform=pd.DataFrame(data)
print(perform)

print("\nMissing Datas\n")         #True means  Missing...
print(perform.isnull())            #False means Present...

     Name   Age    Salary  Performance Score
0     Ram  24.0   50000.0               85.0
1    None   NaN       NaN                NaN
2   Gattu  37.0   70000.0               90.0
3   Arnob  29.0  100000.0               75.0
4   Aditi  23.0   45000.0               81.0
5    Adil  21.0   55000.0               78.0
6     Raj  45.0   85000.0               95.0
7  Simran  61.0   94000.0               93.0

Missing Datas

    Name    Age  Salary  Performance Score
0  False  False   False              False
1   True   True    True               True
2  False  False   False              False
3  False  False   False              False
4  False  False   False              False
5  False  False   False              False
6  False  False   False              False
7  False  False   False              False


In [28]:
# detecting  how many values are missing.          .isnull().sum() 
print(perform.isnull().sum())

Name                 1
Age                  1
Salary               1
Performance Score    1
dtype: int64


In [32]:
#How to drop rows with missing values.     rows = axis 0
data = {
    "Name":["Ram",None,"Gattu","Arnob","Aditi","Adil","Raj","Simran"],
    "Age" :[24,None,37,29,23,21,45,61],
    "Salary":[50000,None,70000,100000,45000,55000,85000,94000],
    "Performance Score":[85,None,90,75,81,78,95,93]
}

perform=pd.DataFrame(data)
print(perform)

print("\nMissing Datas\n")         
print(perform.isnull())  

# print("\nDropping rows\n")

perform.dropna(axis=0, inplace=True)
print(perform)


     Name   Age    Salary  Performance Score
0     Ram  24.0   50000.0               85.0
1    None   NaN       NaN                NaN
2   Gattu  37.0   70000.0               90.0
3   Arnob  29.0  100000.0               75.0
4   Aditi  23.0   45000.0               81.0
5    Adil  21.0   55000.0               78.0
6     Raj  45.0   85000.0               95.0
7  Simran  61.0   94000.0               93.0

Missing Datas

    Name    Age  Salary  Performance Score
0  False  False   False              False
1   True   True    True               True
2  False  False   False              False
3  False  False   False              False
4  False  False   False              False
5  False  False   False              False
6  False  False   False              False
7  False  False   False              False
     Name   Age    Salary  Performance Score
0     Ram  24.0   50000.0               85.0
2   Gattu  37.0   70000.0               90.0
3   Arnob  29.0  100000.0               75.0
4   Aditi  2

In [31]:
#How to drop  columns with missing values.      cols = axis 1
perform.dropna(axis=1, inplace=True)

In [None]:
#fillna()
#fillna(value, inplace=True)
#default value
data = {
    "Name":["Ram",None,"Gattu","Arnob","Aditi","Adil","Raj","Simran"],
    "Age" :[24,None,37,29,23,21,45,61],
    "Salary":[50000,None,70000,100000,45000,55000,85000,94000],
    "Performance Score":[85,None,90,75,81,78,95,93]
}

perform=pd.DataFrame(data)
print(perform)

print("\n Filling Data\n")
perform.fillna("a",inplace=True)
print(perform)

     Name   Age    Salary  Performance Score
0     Ram  24.0   50000.0               85.0
1    None   NaN       NaN                NaN
2   Gattu  37.0   70000.0               90.0
3   Arnob  29.0  100000.0               75.0
4   Aditi  23.0   45000.0               81.0
5    Adil  21.0   55000.0               78.0
6     Raj  45.0   85000.0               95.0
7  Simran  61.0   94000.0               93.0

 Filling Data

     Name   Age    Salary Performance Score
0     Ram  24.0   50000.0              85.0
1       a     a         a                 a
2   Gattu  37.0   70000.0              90.0
3   Arnob  29.0  100000.0              75.0
4   Aditi  23.0   45000.0              81.0
5    Adil  21.0   55000.0              78.0
6     Raj  45.0   85000.0              95.0
7  Simran  61.0   94000.0              93.0


  perform.fillna("a",inplace=True)


In [None]:
#fill calculated value
data = {
    "Name":["Ram",None,"Gattu","Arnob","Aditi","Adil","Raj","Simran"],
    "Age" :[24,None,37,29,23,21,45,61],
    "Salary":[50000,None,70000,100000,45000,55000,85000,94000],
    "Performance Score":[85,None,90,75,81,78,95,93]
}

perform=pd.DataFrame(data)
print(perform)


#calculated fill
perform["Age"].fillna(perform["Age"].mean(), inplace=True)   #Average
perform["Salary"].fillna(perform["Salary"].mean(),inplace=True)  #Average
print(perform)

     Name   Age    Salary  Performance Score
0     Ram  24.0   50000.0               85.0
1    None   NaN       NaN                NaN
2   Gattu  37.0   70000.0               90.0
3   Arnob  29.0  100000.0               75.0
4   Aditi  23.0   45000.0               81.0
5    Adil  21.0   55000.0               78.0
6     Raj  45.0   85000.0               95.0
7  Simran  61.0   94000.0               93.0
     Name        Age         Salary  Performance Score
0     Ram  24.000000   50000.000000               85.0
1    None  34.285714   71285.714286                NaN
2   Gattu  37.000000   70000.000000               90.0
3   Arnob  29.000000  100000.000000               75.0
4   Aditi  23.000000   45000.000000               81.0
5    Adil  21.000000   55000.000000               78.0
6     Raj  45.000000   85000.000000               95.0
7  Simran  61.000000   94000.000000               93.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  perform["Age"].fillna(perform["Age"].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  perform["Salary"].fillna(perform["Salary"].mean(),inplace=True)


# Interpolation .
It is a technique from which we can fill estimated values  in the missing values.
eg- [10,20,NaN,40,50]    Nan=30 which is estimated....

# Types of interpolation.
linear method.
polynomial method.
time method.

Helps in
1. Preserve data integrity.
2. smooth trends.
3. Avoid data loss.

interpolate()   it is a method from which we can fill the missing values of dataframe . it's gives us estimate values.

df.interpolate(method="linear",axis=0, inplace=True)

* When to use interpolate.
1. Time series data.
2. Numeric data.
3. Avoid dropping rows.


In [3]:
#Linear interpolation method.
data ={
    "Time":[1,2,3,4,5],
    "Values" : [10,None,30,None,50]
}

df = pd.DataFrame(data)
print("Before Interpolation")
print(df)

df["Values"] = df["Values"].interpolate(method="linear")
print("\nAfter Interpolation")
print(df)



Before Interpolation
   Time  Values
0     1    10.0
1     2     NaN
2     3    30.0
3     4     NaN
4     5    50.0

After Interpolation
   Time  Values
0     1    10.0
1     2    20.0
2     3    30.0
3     4    40.0
4     5    50.0


In [51]:
data = {
    "Name":["Ram","Sam","Gattu","Arnob","Aditi","Adil","Raj","Simran"],
    "Age" :[24,None,37,29,23,21,45,61],
    "Salary":[50000,None,70000,100000,45000,55000,85000,94000],
    "Performance Score":[85,None,90,75,81,78,95,93]
}

perform=pd.DataFrame(data)
print(perform)

perform.interpolate(method="linear",axis=0,inplace=True)
print("\nAfter Interpolation")
print(perform)

     Name   Age    Salary  Performance Score
0     Ram  24.0   50000.0               85.0
1     Sam   NaN       NaN                NaN
2   Gattu  37.0   70000.0               90.0
3   Arnob  29.0  100000.0               75.0
4   Aditi  23.0   45000.0               81.0
5    Adil  21.0   55000.0               78.0
6     Raj  45.0   85000.0               95.0
7  Simran  61.0   94000.0               93.0

After Interpolation
     Name   Age    Salary  Performance Score
0     Ram  24.0   50000.0               85.0
1     Sam  30.5   60000.0               87.5
2   Gattu  37.0   70000.0               90.0
3   Arnob  29.0  100000.0               75.0
4   Aditi  23.0   45000.0               81.0
5    Adil  21.0   55000.0               78.0
6     Raj  45.0   85000.0               95.0
7  Simran  61.0   94000.0               93.0


  perform.interpolate(method="linear",axis=0,inplace=True)


# Sorting & Aggregation.
* Sorting Data in one column   - df.sort_values(by="ColumnName",True/False,inplace=True) * For Ascending order = True    ,   *For Dscending order = False

In [19]:
#Sorting data in one column...    df.sort_values(by="ColumnName",True/False,inplace=True)

data ={
    "Name":["Arun","Karun","Varun","Pandey"],
    "Age" :[12,14,17,11],
    "Salary":[10000,50000,45000,34000]
}

df=pd.DataFrame(data)
print(df)

print("\nSorting Data of Age \n")
df.sort_values(by="Age",ascending=True,inplace=True)
print(df)

     Name  Age  Salary
0    Arun   12   10000
1   Karun   14   50000
2   Varun   17   45000
3  Pandey   11   34000

Sorting Data of Age 

     Name  Age  Salary
3  Pandey   11   34000
0    Arun   12   10000
1   Karun   14   50000
2   Varun   17   45000


In [None]:
#Sorting Multiples columns....
data ={
    "Name":["Arun","Karun","Varun","Pandey"],
    "Age" :[12,14,17,11],
    "Salary":[10000,50000,45000,34000]
}

df=pd.DataFrame(data)
print(df)

print("\nSorting Multiple Data of Age ,Salary \n")
df.sort_values(by=["Age","Salary"],ascending=[True,False],inplace=True)
print(df)

     Name  Age  Salary
0    Arun   12   10000
1   Karun   14   50000
2   Varun   17   45000
3  Pandey   11   34000

Sorting Multiple Data of Age ,Salary 

     Name  Age  Salary
3  Pandey   11   34000
0    Arun   12   10000
1   Karun   14   50000
2   Varun   17   45000


# Summary Statistics 
* it is numerical summaries of columns such as average value, total value, min , max , etc.      syntax df["Column Name"].mean()  /   df["Column Name"].sum()

In [None]:
#Summary Statistics.......
data ={
    "Name":["Arun","Karun","Varun","Pandey"],
    "Age" :[12,14,17,11],
    "Salary":[10000,50000,45000,34000]
}

df=pd.DataFrame(data)
print(df)

avg_salary = df["Salary"].mean()
print(avg_salary)

sum_salary = df["Salary"].sum()
print(sum_salary)

min_salary = df["Salary"].min()
print(min_salary)

max_salary = df["Salary"].max()
print(max_salary)

median_salary = df["Salary"].median()
print(median_salary)

     Name  Age  Salary
0    Arun   12   10000
1   Karun   14   50000
2   Varun   17   45000
3  Pandey   11   34000
34750.0
139000
10000
50000
39500.0


# What is Grouping in Pandas?
* It's means we can split our data in to small small groups. or we can divide in several groups.  df.groupby("Age")["Salary"].sum()
# Common Aggregation Functions...
1. sum()
2. mean()
3. count()
4. min()
5. max()
6. std()

In [None]:
#Grouping...   Single column...
data ={
    "Name":["Arun","Karun","Varun","Pandey","Amit"],
    "Age" :[28,45,27,27,45],
    "Salary":[40000,50000,45000,64000,75000]
}

df=pd.DataFrame(data)
print(df)

print("\nGrouped Age & Salary")
grouped = df.groupby("Age")["Salary"].sum()
print(grouped)

     Name  Age  Salary
0    Arun   28   40000
1   Karun   45   50000
2   Varun   27   45000
3  Pandey   27   64000
4    Amit   45   75000

Grouped Age & Salary
Age
27    109000
28     40000
45    125000
Name: Salary, dtype: int64


In [None]:
#Grouping Multiple Columns.....
data ={
    "Name":["Arun","Karun","Varun","Pandey","Amit"],
    "Age" :[28,45,27,27,45],
    "Salary":[40000,50000,45000,64000,75000]
}

df=pd.DataFrame(data)
print(df)

print("\nGrouped Age & Salary")
grouped = df.groupby(["Age","Name"])["Salary"].sum()
print(grouped)

     Name  Age  Salary
0    Arun   28   40000
1   Karun   45   50000
2   Varun   27   45000
3  Pandey   27   64000
4    Amit   45   75000

Grouped Age & Salary
Age  Name  
27   Pandey    64000
     Varun     45000
28   Arun      40000
45   Amit      75000
     Karun     50000
Name: Salary, dtype: int64


# Merging & Joining....
* Merging means two or more DataFrames rows combine based on common key column.    '''pd.merge(df1,df2, on= "Column_Name", how="type of join")'''
#  Type of join/merge.
- Inner
- Outer
- Left
- Right
- cross

In [41]:
#Merging & joining ....


#customers dataframe

df_customers = pd.DataFrame({
    "CustomerID" : [1,2,3,4,5],
    "Name":["Raj","Ram","Rohit","Dhruv","Ronit"]
})

#order dataframe

df_orders = pd.DataFrame({
    "CustomerID":[1,2,6,4,7],
    "OrderAmount":[250,450,500,200,600]
})

#merge both dataframe....

#inner join
df_merged =pd.merge(df_customers, df_orders, on="CustomerID", how="inner")
print("inner join")
print(df_merged)

#outer join
df_merged =pd.merge(df_customers, df_orders, on="CustomerID", how="outer")
print("outer join")
print(df_merged)

#left join
df_merged =pd.merge(df_customers, df_orders, on="CustomerID", how="left")
print("left join")
print(df_merged)

#right join
df_merged =pd.merge(df_customers, df_orders, on="CustomerID", how="right")
print("right join")
print(df_merged)


inner join
   CustomerID   Name  OrderAmount
0           1    Raj          250
1           2    Ram          450
2           4  Dhruv          200
outer join
   CustomerID   Name  OrderAmount
0           1    Raj        250.0
1           2    Ram        450.0
2           3  Rohit          NaN
3           4  Dhruv        200.0
4           5  Ronit          NaN
5           6    NaN        500.0
6           7    NaN        600.0
left join
   CustomerID   Name  OrderAmount
0           1    Raj        250.0
1           2    Ram        450.0
2           3  Rohit          NaN
3           4  Dhruv        200.0
4           5  Ronit          NaN
right join
   CustomerID   Name  OrderAmount
0           1    Raj          250
1           2    Ram          450
2           6    NaN          500
3           4  Dhruv          200
4           7    NaN          600


In [None]:
#cross join 

# 1df = m rows

# 2df = n rows

# m * n rows

# Concatinating DataFrames...
Combine data Vertically or Horizontally 
* Vertically (row-wise)
* Horizontally (column-wise)

* Syntax

pd.concate([df1,df2],axis=0, ignore_index=True)

[df1, df2] =
axis = 1
ignore_index = True

In [42]:
#Vertically (row-wise)

#region1
df_Region1 = pd.DataFrame({
    "CustomerID":[1,2],
    "Name" : ["Gopal","Raju"]
})

#region2
df_Region2 = pd.DataFrame({
    "CustomerID":[3,4],
    "Name" : ["Leela","Shruti"]
})

#concatenate vertically....

df_concat = pd.concat([df_Region1,df_Region2], axis = 0, ignore_index=True)
print(df_concat)


   CustomerID    Name
0           1   Gopal
1           2    Raju
2           3   Leela
3           4  Shruti


In [43]:
#Horizontally (column-wise)

#region1
df_Region1 = pd.DataFrame({
    "CustomerID":[1,2],
    "Name" : ["Gopal","Raju"]
})

#region2
df_Region2 = pd.DataFrame({
    "CustomerID":[3,4],
    "Name" : ["Leela","Shruti"]
})

#concatenate horizontally.....

df_concat = pd.concat([df_Region1,df_Region2], axis = 1, ignore_index=True)
print(df_concat)

   0      1  2       3
0  1  Gopal  3   Leela
1  2   Raju  4  Shruti
