# Pandas is an important Python library for data manipulation,  
# and analysis. It functions as an intuitive and easy-to-use set of tools 
# for performing operations on any kind of data.

Data Structures of Pandas
-------------------------
All the data representation in pandas is done using two primary data structures:
• Series
• Dataframes

Series
Series in pandas is a one-dimensional ndarray with an axis label. It means that in functionality, it is almost similar to a simple array. 


Dataframe
1> Dataframe is the most important and useful data structure, which is used for almost all kind of data representation and manipulation in pandas. Unlike numpy arrays (in general) a dataframe can contain heterogeneous data. 
2> Typically tabular data is represented using dataframes, which is analogous to an Excel sheet or a SQL table. This is extremely useful in representing raw datasets as well as processed feature sets in Machine Learning and Data Science. 
3> All the operations can be performed along the axes, rows, and columns, in a dataframe.

In [1]:
import pandas
pandas.__version__

'1.5.3'

In [2]:
import pandas as pd

In [3]:
# Example of series data type 

population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}

population = pd.Series(population_dict)
print(population)

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64


In [4]:
# By default, a Series will be created 
# where the index is drawn from the sorted values.

# Unlike a dictionary, though, 
# the Series also supports array-style operations such as slicing:

population['California':'Florida']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

Constructing Dataframes
-----------------------
The next fundamental structure in Pandas is the DataFrame. 
Like the Series object, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.

A Pandas DataFrame can be constructed in a variety of ways. 
Here we’ll give several examples.

> From a single Series object. 
A DataFrame is a collection of Series objects, and a single column
DataFrame can be constructed from a single Series:

In [5]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [6]:
age =[21,22,23]
name = ['a','b','c']
df = pd.DataFrame({'Name':name,'Age':age})
df

Unnamed: 0,Name,Age
0,a,21
1,b,22
2,c,23


In [7]:
# From a list of dicts. Any list of dictionaries can be made into a DataFrame. 
# We’ll use a simple list comprehension to create some data:

data = [{'a': i, 'b': 2 * i} for i in range(35)]
df=pd.DataFrame(data)
df

# please note: here column names are derived from the dictionary keys itself
# figure out : How can I change the column-names of a df?
# try out df.columns =['A', 'B']
# df.columns =['A', 'B']
# df

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4
3,3,6
4,4,8
5,5,10
6,6,12
7,7,14
8,8,16
9,9,18


In [8]:
# Even if some keys in the dictionary are missing, 
# Pandas will fill them in with NaN (i.e., “not a number”) values:

pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [9]:
import numpy as np
pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'],index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.838795,0.791263
b,0.944542,0.700879
c,0.914281,0.761549


Data Retrieval
--------------
Pandas provides numerous ways to retrieve and read in data. We can convert data from CSV files, databases, flat files, and so on into dataframes. We can also convert a list of dictionaries (Python dict) into a dataframe.

We will cover three of the most important data sources:
• List of dictionaries
• CSV files
• Databases

In [10]:
# List of Dictionaries to Dataframe

import pandas as pd
d =  [{'city':'Delhi',"data":1000},
      {'city':'Banglaore',"data":2000},
      {'city':'Mumbai',"data":1000}]
pd.DataFrame(d)  

# Two important things to note here: 
# first,  the keys of dictionary are picked up as the column names 
# in the dataframe 
# secondly, it picks up the default index of normal arrays.

Unnamed: 0,city,data
0,Delhi,1000
1,Banglaore,2000
2,Mumbai,1000


In [11]:
import pandas as pd

In [12]:
# CSV Files to Dataframe  -> read student_records.csv file

data = pd.read_csv('C:\\Users\\admin\\Desktop\\test.csv')
data.head()

Unnamed: 0,Subject,grade,Student,Marks
0,a,1,abc,20
1,b,2,xyz,30
2,c,1,pqr,40
3,d,2,mno,10
4,e,1,def,45


In [13]:
data.tail()

Unnamed: 0,Subject,grade,Student,Marks
1,b,2,xyz,30
2,c,1,pqr,40
3,d,2,mno,10
4,e,1,def,45
5,f,2,xcv,50


In [14]:
data.describe()

Unnamed: 0,grade,Marks
count,6.0,6.0
mean,1.5,32.5
std,0.547723,15.411035
min,1.0,10.0
25%,1.0,22.5
50%,1.5,35.0
75%,2.0,43.75
max,2.0,50.0


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Subject  6 non-null      object
 1   grade    6 non-null      int64 
 2   Student  6 non-null      object
 3   Marks    6 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 320.0+ bytes


In [16]:
data.shape

(6, 4)

In [17]:
data

Unnamed: 0,Subject,grade,Student,Marks
0,a,1,abc,20
1,b,2,xyz,30
2,c,1,pqr,40
3,d,2,mno,10
4,e,1,def,45
5,f,2,xcv,50


In [18]:
data['Marks'] =data['Marks']+5


In [19]:
data["Grade_marks"] = data["grade"] * data['Marks']

In [20]:
data

Unnamed: 0,Subject,grade,Student,Marks,Grade_marks
0,a,1,abc,25,25
1,b,2,xyz,35,70
2,c,1,pqr,45,45
3,d,2,mno,15,30
4,e,1,def,50,50
5,f,2,xcv,55,110


In [21]:
df1 = data[data["grade"]==1]
df1

Unnamed: 0,Subject,grade,Student,Marks,Grade_marks
0,a,1,abc,25,25
2,c,1,pqr,45,45
4,e,1,def,50,50


In [22]:
df1["Marks"]=df1['Marks']+2
df1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1["Marks"]=df1['Marks']+2


Unnamed: 0,Subject,grade,Student,Marks,Grade_marks
0,a,1,abc,27,25
2,c,1,pqr,47,45
4,e,1,def,52,50


In [23]:
df2 = data[data['grade']==2]
df2

Unnamed: 0,Subject,grade,Student,Marks,Grade_marks
1,b,2,xyz,35,70
3,d,2,mno,15,30
5,f,2,xcv,55,110


In [24]:
df2['Marks']=df2["Marks"]+3

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['Marks']=df2["Marks"]+3


In [25]:
df2

Unnamed: 0,Subject,grade,Student,Marks,Grade_marks
1,b,2,xyz,38,70
3,d,2,mno,18,30
5,f,2,xcv,58,110


In [27]:
df3=pd.concat((df1,df2),axis=0)
df3

Unnamed: 0,Subject,grade,Student,Marks,Grade_marks
0,a,1,abc,27,25
2,c,1,pqr,47,45
4,e,1,def,52,50
1,b,2,xyz,38,70
3,d,2,mno,18,30
5,f,2,xcv,58,110


In [27]:
data.max()

Subject      e
grade        2
Student    xyz
Marks       45
dtype: object

In [None]:
df2 = data[]

In [28]:
data['Marks'].max()

45

In [30]:
data['Marks'].mean()

29.0

In [28]:
#argmax() function returns the indices of the maximum value present in the input Index. If we are having more than one maximum value (i.e. maximum value is present more than once) then it returns the index of the first occurrence of the maximum value
data['Marks'].argmax()



5

In [None]:
data['Season'].value_counts()

# Accessing Values
Then, in order to get attributes about the game, we need to use the **iloc[]** function. Iloc is definitely one of the more important functions. The main idea is that you want to use it whenever you have the integer index of a certain row that you want to access. As per Pandas documentation, iloc is an "integer-location based indexing for selection by position."

In [33]:
data.iloc[[data['Marks'].argmax()]]

Unnamed: 0,Subject,grade,Student,Marks
4,e,1,def,45


In [34]:
data.iloc[[data['Wscore'].argmax()]]['Lscore']

KeyError: 'Wscore'

In [None]:
type(data.iloc[[data['Wscore'].argmax()]]['Lscore'])

In [None]:
type(data.iloc[[data['Wscore'].argmax()]])

In [None]:
data.iloc[:3]

In [None]:
data.loc[:3]

In [None]:
data.loc[data['Wscore'].argmax(), 'Lscore']

In [None]:
data.at[data['Wscore'].argmax(), 'Lscore']

# Sorting

In [None]:
data.sort_values('Lscore').tail()

# Filtering Rows Conditionally

In [None]:
data[data['Wscore'] > 150]

In [35]:
data[(data['Wscore'] > 150) & (data['Lscore'] < 100)]

KeyError: 'Wscore'

# Grouping

In [None]:
data.groupby('Wteam')['Wscore'].mean().head(50)

In [None]:
data.groupby('Wteam')['Wloc'].value_counts().head(9)

In [None]:
data.values

In [None]:
data.values[0][0]

# Extracting Rows and Columns

In [None]:
data[['Wscore', 'Lscore']].head(50)

In [None]:
data[0:3]

In [None]:
data.iloc[0:3]

# Data Cleaning

In [None]:
data.isnull().sum()

In [None]:
df.isnull() 

In [None]:
df.notnull()

In [None]:
df.fillna(0)


In [None]:
data.replace(to_replace = np.nan, value = -99) 

In [None]:
df.dropna() 

In [None]:
data.dropna() 

# IPL Datasets

In [None]:
# CSV Files to Dataframe  -> read student_records.csv file

df = pd.read_csv('matches.csv')
df.head()

In [None]:
df.columns.to_list()

In [None]:
df.shape

In [None]:
data.describe()

In [None]:
df.info()

In [None]:
df['win_by_runs'].max()

In [None]:
df['win_by_wickets'].mean()

In [36]:
df.iloc[df['win_by_runs'].argmax()]

KeyError: 'win_by_runs'

In [None]:
df.iloc[df[df['win_by_wickets'].ge(1)].win_by_wickets.idxmin()]

### How many matches we've got in the dataset?

In [None]:
#matches.shape[0]

df['id'].max()

### How many seasons we've got in the dataset?

In [None]:
df['season'].unique()

In [None]:
len(df['season'].unique())

### Which Team had won by maximum runs?

In [None]:
df.iloc[df['win_by_runs'].idxmax()]

In [None]:
df.iloc[df['win_by_runs'].argmax()]

### Which Team had won by maximum wickets?

In [None]:
df.iloc[df['win_by_wickets'].argmax()]

### Which Team had won by (closest margin) minimum runs?

In [None]:
df.iloc[df[df['win_by_runs'].ge(1)].win_by_runs.idxmin()]

In [None]:
df[df[df['win_by_runs'].ge(1)].win_by_runs.min() == df['win_by_runs']]['winner']  #to handle the issue of only one team being shown 

### Which Team had won by minimum wickets?

In [None]:
df.iloc[df[df['win_by_wickets'].ge(1)].win_by_wickets.idxmin()]

### Has Toss-winning helped in Match-winning?

In [None]:
ss = df['toss_winner'] == df['winner']

ss.groupby(ss).size()