In [None]:
# In Excel, it's very intuitive to perform various operations like adding, merging, and filtering data using built-in functions, as well as working with pivot tables. However, when you're working with Python, managing and manipulating data can seem more complex. This is where **Pandas** comes in.

# **Pandas** is a powerful Python library that provides a **DataFrame** structure, which is similar to an Excel table. It allows you to store and organize data in rows and columns, making it easy to:

# 1. **Store data** in a structured, tabular format (like Excel tables).
# 2. **Perform operations** such as adding, merging, and filtering data.
# 3. **Analyze and manipulate data** efficiently.
# 4. **Handle missing data**, group data, and create pivot tables.
# 5. **Read from and write to** different file formats like CSV, Excel, SQL databases, and more.

# So, just like you perform operations easily in Excel, **Pandas** allows you to perform similar operations on data efficiently in Python. It’s designed to simplify working with large datasets and performing complex data analysis.


In [None]:
# NOTE: so pandas is nothing but a strong library which supports seriees and dataframme to store data like in excel and also contains many operation to perfor on data similar to  operations we can perfor in excel like filtering, erging,sorting , handling issing data right

In [1]:
#first thing
import pandas as pd #aliasing

In [2]:
#version
#to check version of pandas in your systemm

pd.__version__

'2.2.2'

In [None]:
# NOTE:Whatever operations you do in Excel — like storing data, sorting, filtering, merging, and even handling missing values — you can do in Python using Pandas.

In [None]:
#SERIES AND DATAFRAMME

# series and dataframme are data structutres in pandas used to store and organise data in  way that similar to excel

In [None]:
#SERIES

# A Pandas Series is a one-dimensional array data structure used to store and organize data in a format similar to a single column in Excel. Each element in the Series is associated with an index, allowing for efficient access and manipulation. A Series can hold various data types, including integers, floats, strings, and more, making it versatile for data analysis. It enables operations such as slicing, filtering, and applying functions efficiently.


In [3]:
#EXAMPLE:

# In Excel, if you want to create a column with values like 0.25, 0.5, 0.75, 1, you simply open Excel and type those values vertically in a column. This process is straightforward.

# Similarly, when you want to achieve the same in Python, you can use Pandas Series to easily create and organize this data in a single column format.

#exammple:creating a series using list 
data=[0.25,0.5,0.75,1]
res=pd.Series(data)#s-capital
print(res)

#NOTE:
# so every value is associated with index
# and index in excel starts from 1 and index in pandas starts from 0


0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64


In [4]:
#df.values:
# when we use df.values attriubute of pandas , it will convert data in series or dataframme into a nummpy array

# when data was in series(single column data) using values convert to 1D array
# similarly when data was in dataframe using values convert to 2D array

#EXAMPLE:
data=[0.25,0.5,0.75,1]
res=pd.Series(data)
print(res.values)

[0.25 0.5  0.75 1.  ]


In [6]:
#ACCESSING ELEMMENTS (INDEXING)

# so after organising data into series or dataframme every value is associated with index , so we can access elemmments through index

#EXAMPLE:
data=[0.25,0.5,0.75,1]
res=pd.Series(data)
print(res)
#ACCESSING
res1=res[0]
print(res1)
print(res[1])

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
0.25
0.5


In [8]:
#MODIFYING INDEXES

#after organising data into series or dataframe every value is associated with index
#by default index are (0 to 1)
#we can modify index according to our choice

#EXAMMPLE:
data=[0.25,0.5,0.75,1]
res=pd.Series(data,index=['a','b','c','d'])
print(res)
#accesing elements: we can acess values through there index values only
res1=res['a']
print(res1)
print(res['b'])

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
0.25
0.5


In [None]:
#we can create a pandas series from
# A. lists
# B. dict
# C. np.array
# D. scalar values

In [11]:
#creating a pandas series from dictionary

#As we know that dict is key-value pair
#so if we want to access values we will access index
# when you create a pandas series fromm dict the keys in dict becomme index and values becomme data of series

#EXAMPLE:
pop={'AP':1000,'TS':800,'TN':6700,'KA':3400}
res=pd.Series(pop)
print(res)

#SLICING: keys become index so if we want to acces we use keys as index
#example: i want to access from TS to KA
res1=res['TS':'KA']
print(res1)

#ANOTHER WAY:we can also acces values by usinng integers(0 to n)
res2=res[1:]
print(res2)

#QUESTION: as we know key are becomme index then we want to access only keys but how we can use both keys or integers(0 to n).

#pandas have 2 types of indexing

# A. LABEL-BASED INDEXING: uses the index labels(which are the dict keys in your case) to access values

# B. INTEGER-BASED INDEXING: uses the position(integers) of the elements in series regerdless of what the index labels are

#NOTE: we can use integers(0 to n)  as index even through keys are become index and we can use keys also . we can use any one.

AP    1000
TS     800
TN    6700
KA    3400
dtype: int64
TS     800
TN    6700
KA    3400
dtype: int64
TS     800
TN    6700
KA    3400
dtype: int64


In [None]:


#Comparison of NumPy and Pandas

# - **NumPy**:
#   - **Supports**: N-dimensional arrays (1D, 2D, etc.).
#   - **Operations**: Provides various mathematical operations for these arrays.
#   - **Data Type**: Only holds **numerical data** (integers, floats).
#   - **Use Case**: Primarily used when you need to perform **numerical operations** on arrays.

# - **Pandas**:
#   - **Supports**: Data organization in tables (DataFrames) and series (single columns).
#   - **Operations**: Allows you to perform a wide range of data operations (filtering, merging, sorting).
#   - **Data Type**: Can hold **any data type** (numbers, strings, dates, etc.).
#   - **Use Case**: Used when you need to perform data operations in Python that are similar to those in Excel.



In [None]:
Here’s a simple explanation of both **NumPy** and **Pandas**:

### NumPy:
- **Definition**: NumPy (Numerical Python) is a library in Python that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
- **Key Features**:
  - Handles **numerical data**.
  - Supports **N-dimensional arrays** (1D, 2D, etc.).
  - Provides efficient operations for mathematical computations.
- **Use Case**: It is primarily used for numerical calculations and data analysis, particularly in scientific computing.

### Pandas:
- **Definition**: Pandas is a library in Python designed for data manipulation and analysis. It provides data structures like Series and DataFrames to store and organize data in a tabular format similar to Excel.
- **Key Features**:
  - Supports various data types (not just numerical).
  - Provides tools for **data manipulation** (filtering, merging, aggregating).
  - Allows easy reading from and writing to different file formats (CSV, Excel, etc.).
- **Use Case**: It is used for data analysis tasks, making it easier to perform operations on structured data, similar to what you would do in Excel.

### Summary:
- **NumPy** is focused on numerical data and mathematical operations, while **Pandas** is focused on data manipulation and analysis, offering a more flexible way to handle various data types in a structured format.

In [13]:
# C. creating a series form np.array
#exaple:

import numpy as np
import pandas as pd
arr=np.array([1,2,3,4,5])
res=pd.Series(arr)
print(res)

0    1
1    2
2    3
3    4
4    5
dtype: int32


In [14]:
# D. creating a series from scalar values
#Exammple:

series_from_scalar = pd.Series(5, index=[0, 1, 2, 3])
print(series_from_scalar)


0    5
1    5
2    5
3    5
dtype: int64


In [20]:
# we can also check type of data type

pop={'AP':1000,'TS':800,'TN':6700,'KA':3400}
res=pd.Series(pop)
print(res)
print(type(pop))
print(type(res))

AP    1000
TS     800
TN    6700
KA    3400
dtype: int64
<class 'dict'>
<class 'pandas.core.series.Series'>


In [None]:
#DATAFRAME:

# A pandas dataframe is a 2D array like datastructure used to store and organise data in a tabular form , much like in excel spread sheet . it consists ofrows and columns , with value is associated with row index and column namme.
# dataframe is also called as collection of series
# allowing for easy access and manipulation
# dataframe can hold any datatypes
# provides operations like filtering , slicing , mmanipulation etc siilar to the opration in excel.

In [None]:
#for example in excel i want to store and organise two data population [1000 , 800, 6700,3400]  and area [10, 8, 15,25] and it is easy to do in excel by just open excel and enter data vertically in 2 columns and we can also give colummn nammes such we can understand what the data we have and we can also perfrom various operation in excel easily

# but when i want to achevie same thing in python it is difficult so we use pandas .  Pandas provides a structured way to handle data similar to how you would in Excel, but within Python. It simplifies working with tabular data and allows you to perform operations programmatically.

# Here's how you can achieve the same thing in Python using Pandas:

# Import Pandas.
# Create a DataFrame with the data.
# Assign column names like "Population" and "Area."
# Perform operations like filtering, sorting, or calculations.

In [31]:
#EXAMPLE:

# using list create dataframe with same keys
# here keys of both pop and pop dict are same
import pandas as pd
pop=[1000,800,6700,3400]
area=[10,8,15,25]
res=pd.DataFrame({'population':pop,'area':area})
print(res)

# Column Names: The keys from the outer dictionary you created ('population' and 'area') become the column names of the DataFrame.

# Row Index: The keys from both the pop and area dictionaries ('AP', 'TS', 'TN', and 'KA') serve as the row index. Since both dictionaries share the same keys, Pandas aligns the data correctly based on these keys.

    population  area
AP      1000.0   NaN
TS       800.0   NaN
TN      6700.0   NaN
KA      3400.0   NaN
a          NaN  10.0
b          NaN   8.0
c          NaN  15.0
d          NaN  25.0


In [34]:
#EXAPLE: using dict create dataframe with same keys
# here keys of both pop and pop dict are same

import pandas as pd
pop={'AP':1000,'TS':800,'TN':6700,'KA':3400}
area={'AP':10,'TS':8,'TN':15,'KA':25}
res=pd.DataFrame({'population':pop,'area':area})
print(res)

# Column Names: The keys from the outer dictionary you created ('population' and 'area') become the column names of the DataFrame.

# Row Index: The keys from both the pop and area dictionaries ('AP', 'TS', 'TN', and 'KA') serve as the row index. Since both dictionaries share the same keys, Pandas aligns the data correctly based on these keys.

    population  area
AP        1000    10
TS         800     8
TN        6700    15
KA        3400    25


In [33]:
#EXAMPLE:
# using dict create dataframe with different keys
# if keys of both are area and pop are different
pop={'AP':1000,'TS':800,'TN':6700,'KA':3400}
area={'a':10,'b':8,'c':15,'d':25}
res=pd.DataFrame({'population':pop,'area':area})
print(res)

    population  area
AP      1000.0   NaN
TS       800.0   NaN
TN      6700.0   NaN
KA      3400.0   NaN
a          NaN  10.0
b          NaN   8.0
c          NaN  15.0
d          NaN  25.0


In [23]:
#EXAMPLE:

# using list create dataframe with same keys
# here keys of both pop and pop dict are same
import pandas as pd
pop=[1000,800,6700,3400]
area=[10,8,15,25]
res=pd.DataFrame({'population':pop,'area':area})
print(res)

# Column Names: The keys from the outer dictionary you created ('population' and 'area') become the column names of the DataFrame.

# Row Index: The keys from both the pop and area dictionaries ('AP', 'TS', 'TN', and 'KA') serve as the row index. Since both dictionaries share the same keys, Pandas aligns the data correctly based on these keys.

   population  area
0        1000    10
1         800     8
2        6700    15
3        3400    25


In [32]:
#EXAPLE: when somme keys in both pop and area are same
#here TS and TN are same in both pop and area dictinary

pop={'AP':1000,'TS':800,'TN':6700,'KA':3400}
area={'a':10,'TS':8,'TN':15,'d':25}
res=pd.DataFrame({'population':pop,'area':area})
print(res)

#NOTE: It is highly recommended to give colummn names while dealing with dataframe in pandas , but it is not compulsory.

    population  area
AP      1000.0   NaN
TS       800.0   8.0
TN      6700.0  15.0
KA      3400.0   NaN
a          NaN  10.0
d          NaN  25.0


In [35]:
#df.value
# we know that when we use .values in dataframe it will convert data in dataframme to 2d array

#EXAMPLE:
pop={'AP':1000,'TS':800,'TN':6700,'KA':3400}
area={'AP':10,'TS':8,'TN':15,'KA':25}
res=pd.DataFrame({'population':pop,'area':area})
print(res)
res2=res.values
print(res2)


    population  area
AP        1000    10
TS         800     8
TN        6700    15
KA        3400    25
[[1000   10]
 [ 800    8]
 [6700   15]
 [3400   25]]


In [36]:
#EXAMPLE: with different keys

pop={'AP':1000,'TS':800,'TN':6700,'KA':3400}
area={'a':10,'b':8,'c':15,'d':25}
res=pd.DataFrame({'population':pop,'area':area})
print(res)
res2=res.values
print(res2)

    population  area
AP      1000.0   NaN
TS       800.0   NaN
TN      6700.0   NaN
KA      3400.0   NaN
a          NaN  10.0
b          NaN   8.0
c          NaN  15.0
d          NaN  25.0
[[1000.   nan]
 [ 800.   nan]
 [6700.   nan]
 [3400.   nan]
 [  nan   10.]
 [  nan    8.]
 [  nan   15.]
 [  nan   25.]]


In [40]:
# TO GET INDEX & COLUMN NAMES OF DATAFRAME: i want to know what are indexs &column names of my dataframe

#example:
pop={'AP':1000,'TS':800,'TN':6700,'KA':3400}
area={'AP':10,'TS':8,'TN':15,'KA':25}
res=pd.DataFrame({'population':pop,'area':area})
print(res)
# so i want what are the indexs of these dataframe
print(res.index)
#similarly i want what are columns names of dataframes
print(res.columns)

    population  area
AP        1000    10
TS         800     8
TN        6700    15
KA        3400    25
Index(['AP', 'TS', 'TN', 'KA'], dtype='object')
Index(['population', 'area'], dtype='object')


In [41]:
# TO GET A SPECIFIC COLUMN
# to get a specific area column

pop={'AP':1000,'TS':800,'TN':6700,'KA':3400}
area={'AP':10,'TS':8,'TN':15,'KA':25}
res=pd.DataFrame({'population':pop,'area':area})
print(res)
#  now i want to get only area colummn
print(res['area'])

#so we get only area column with these index values

    population  area
AP        1000    10
TS         800     8
TN        6700    15
KA        3400    25
AP    10
TS     8
TN    15
KA    25
Name: area, dtype: int64


In [42]:
# here in above keys(AP,TS,TN,KA) are become index , but now i want that index in separate column
# i want to make states AP, TN,TS,KA in a separate column
pop=[1000,800,6700,3400]
area=[10,8,15,25]
state=['AP','TS','TN','KA']
res=pd.DataFrame({'state':state,'population':pop,'area':area})
print(res)

  state  population  area
0    AP        1000    10
1    TS         800     8
2    TN        6700    15
3    KA        3400    25


In [47]:
# 1. Seperate dictionaries
# 2. List of dictionaries
# 3. Single dict with lists

In [48]:
# 1. Separate dictinaries
#  When you use separate dictionaries like in your example, the keys from each dictionary (i.e., 'AP', 'TS', 'TN', and 'KA') become the row labels (index). The column names are specified when building the DataFrame by providing labels like 'population' and 'Area'.

pop={'AP':1000,'TS':800,'TN':6700,'KA':3400}
area={'AP':10,'TS':8,'TN':15,'KA':25}
res=pd.DataFrame({'population':pop,'Area':area})
print(res)

    population  Area
AP        1000    10
TS         800     8
TN        6700    15
KA        3400    25


In [49]:
# 2. List of dictionaries
# When using a list of dictionaries in a DataFrame, each dictionary represents a row data, and the keys of the dictionaries become the column names.

data=[{'state':'AP','Pop':1000,'area':10},
      {'state':'TS','Pop':800,'area':8},
      {'state':'TN','Pop':6700,'area':15},
      {'state':'KA','Pop':3400,'area':25}]
res=pd.DataFrame(data)
print(res)

  state   Pop  area
0    AP  1000    10
1    TS   800     8
2    TN  6700    15
3    KA  3400    25


In [50]:
# Single dictionary with list values
# When you use a single dictionary where the values are lists, the keys become the column names, and the values in the lists are stored column-wise.

data={'state':['AP','TS','TN','KA'],
      'pop':[1000,800,6700,3400],
      'area':[10,8,15,25]}
res=pd.DataFrame(data)
print(res)

  state   pop  area
0    AP  1000    10
1    TS   800     8
2    TN  6700    15
3    KA  3400    25


In [52]:
#example:
data=[{'a':i,'b':2*i} for i in range(10)]
res=pd.DataFrame(data)
print(res)
# we know that when we use list of dict keys become column names and each dict represent row data

   a   b
0  0   0
1  1   2
2  2   4
3  3   6
4  4   8
5  5  10
6  6  12
7  7  14
8  8  16
9  9  18


In [54]:
#example:
data=[{'a':10,'b':20},
      {'c':30,'d':40}]
res=pd.DataFrame(data)
print(res)
#here there are no commmmon keys so all keys becomme column nammmes 

      a     b     c     d
0  10.0  20.0   NaN   NaN
1   NaN   NaN  30.0  40.0


In [58]:
#HOME WORK: create dataframe with 1st column even nummmbers b/w 1 to 100 and 2nd column odd numbers b/w 1 to 100.
data=[{'even':i+1,'odd':i+2} for i in range(1,101,2)]
res=pd.DataFrame(data)
print(res)

data={'even':[i+1 for i in range(1,101)],
      'odd':[i+2 for i in range(1,101)]}
res=pd.DataFrame(data)
print(res)

    even  odd
0      2    3
1      4    5
2      6    7
3      8    9
4     10   11
5     12   13
6     14   15
7     16   17
8     18   19
9     20   21
10    22   23
11    24   25
12    26   27
13    28   29
14    30   31
15    32   33
16    34   35
17    36   37
18    38   39
19    40   41
20    42   43
21    44   45
22    46   47
23    48   49
24    50   51
25    52   53
26    54   55
27    56   57
28    58   59
29    60   61
30    62   63
31    64   65
32    66   67
33    68   69
34    70   71
35    72   73
36    74   75
37    76   77
38    78   79
39    80   81
40    82   83
41    84   85
42    86   87
43    88   89
44    90   91
45    92   93
46    94   95
47    96   97
48    98   99
49   100  101
    even  odd
0      2    3
1      3    4
2      4    5
3      5    6
4      6    7
..   ...  ...
95    97   98
96    98   99
97    99  100
98   100  101
99   101  102

[100 rows x 2 columns]


In [63]:
#numpy array to pandas dataframe
import numpy as np
arr=np.random.random((3,2))
res=pd.DataFrame(arr)
print(res)
#if you want your own row index and colummn names
res2=pd.DataFrame(arr,columns=['a','b'],index=['c','d','e'])
print(res2)

          0         1
0  0.402881  0.260388
1  0.467722  0.350959
2  0.109761  0.805377
          a         b
c  0.402881  0.260388
d  0.467722  0.350959
e  0.109761  0.805377


In [3]:
# Create a dataFrame or a table of all ones with 4 rows and 5 columns with row index was [a,b,c,d] and column name [a,b,c,d,e]

import numpy as np
import pandas as pd
data=pd.DataFrame(np.ones((4,5),dtype="int"),index=['a','b','c','d'], columns=['a','b','c','d','e'])
print(data)

   a  b  c  d  e
a  1  1  1  1  1
b  1  1  1  1  1
c  1  1  1  1  1
d  1  1  1  1  1


In [21]:
# Create a 1D array of 5 numbers between 1 to 10 using linspace
data=np.linspace(1,10,5)
print(data)

# Now convert these array to a series with index [a,b,c,d,e]
res=pd.Series(data,index=['a','b','c','d','e'])
print(res)

# Now i want to extract value form index b
res1=res['b']
print(res1)

# Now i want to change the value of index b to 12
res['b']=12
print(res)

# Now i want to print top 3 values
res1=res['a':'c']
print(res1)
# another method
res1=res[0:3]
print(res1)

# Now i want to filter the values which are greater then 6 from series
res1=res[res>6]
print(res1)

# Simmilarly now i want  to filter the values which are greater then 6 and les then 10 in series
res1=res[(res>6) & (res<10)]# dont use and use only &
print(res1)
(res>6) & (res<10)

[ 1.    3.25  5.5   7.75 10.  ]
a     1.00
b     3.25
c     5.50
d     7.75
e    10.00
dtype: float64
3.25
a     1.00
b    12.00
c     5.50
d     7.75
e    10.00
dtype: float64
a     1.0
b    12.0
c     5.5
dtype: float64
a     1.0
b    12.0
c     5.5
dtype: float64
b    12.00
d     7.75
e    10.00
dtype: float64
d    7.75
dtype: float64


a    False
b    False
c    False
d     True
e    False
dtype: bool

In [26]:
# EXAMPLE:
area={'california':423967,'Texas':695662,'Newyork':141297,'Florida':170312,'Illinois':149995}
pop={'california':3833251,'Texas':26448193,'Newyork':19651127,'Florida':19552860,'Illinois':12882135}
#now create dataframe using column name as population and area
res=pd.DataFrame({'area':area,'Population':pop})
print(res)

# Now fromm these i want only area column
print(res['area'])

              area  Population
california  423967     3833251
Texas       695662    26448193
california    423967
Texas         695662
Newyork       141297
Florida       170312
Illinois      149995
Name: area, dtype: int64


In [2]:
# .LOC[]:
# The .loc[] accessor in pandas is used for label-based indexing, allowing you to select data from Series or DataFrames by index labels (for rows) and column names (for DataFrames)

#---------------------------------------------------------------------------------------------------------------------------------------------------------

#.LOC[] IN SERIES (1D DIMMENSION):
# A Series is like a single column or list of values with an associated index. The .loc[] function in Series is used to select elements based on index labels.
import pandas as pd
area={'california':423967,'Texas':695662,'Newyork':141297,'Florida':170312,'Illinois':149995}
res=pd.Series(area)
print(res)

# Now i want to access a value of california
print(res.loc['california'])

# now i want to acces values of california and texas
print(res.loc[['california','Texas']])

# Now i want to access values of from texas to illinois (slicing)
print(res.loc['Texas':'Illinois'])

# Key Points:
# .loc[] allows you to select data by the index labels.
# You can select a single value, a list of values, or a range of values.
# Label slicing with .loc[] in a Series includes the endpoint.
# You can use conditions inside .loc[] for filtering data.

# ---------------------------------------------------------------------------------------------------------------------------------------------------

#.LOC[] IN DATAFRAME (2D DIMENSION)
# In a DataFrame, .loc[] is used to access rows and columns dat based on labels. Since a DataFrame has both row index labels and column names, .loc[] allows you to access both.
area={'california':423967,'Texas':695662,'Newyork':141297,'Florida':170312,'Illinois':149995}
pop={'california':3833251,'Texas':26448193,'Newyork':19651127,'Florida':19552860,'Illinois':12882135}
res=pd.DataFrame({'area':area,'population':pop})
print(res)

# # Now i want to print a commplete row of california
print(res.loc['california'])

# now in want to print complete rows of california andd texas
print(res.loc[['california','Texas']])

# now i want to print all rows from texas to illinois
print(res.loc['Texas':'Illinois'])

# now i want to print only area column
print(res['area'])

# now i want to print area value of california
print(res.loc['california','area'])

#now i want to print area values of california and texas 
print(res.loc[['california','Texas'],'area'])

#now i want to print area values from texas to illinois
print(res.loc['Texas':'Illinois','area'])

# KEY POINTS:
# .loc[] is used to select data by row labels and column names.
# You can select a single value, multiple values, or a range of values.
# You can use conditional filtering with .loc[] in DataFrames.
# When slicing rows, the end value is included.

california    423967
Texas         695662
Newyork       141297
Florida       170312
Illinois      149995
dtype: int64
423967
california    423967
Texas         695662
dtype: int64
Texas       695662
Newyork     141297
Florida     170312
Illinois    149995
dtype: int64
              area  population
california  423967     3833251
Texas       695662    26448193
Newyork     141297    19651127
Florida     170312    19552860
Illinois    149995    12882135
area           423967
population    3833251
Name: california, dtype: int64
              area  population
california  423967     3833251
Texas       695662    26448193
            area  population
Texas     695662    26448193
Newyork   141297    19651127
Florida   170312    19552860
Illinois  149995    12882135
california    423967
Texas         695662
Newyork       141297
Florida       170312
Illinois      149995
Name: area, dtype: int64
423967
california    423967
Texas         695662
Name: area, dtype: int64
Texas       695662
Newyork  

In [12]:
#ILOC[]:
# It is used to select rows and columns in a DataFrame or Series by their integer positions.

#NOTE:
#.loc[] is label-based indexing, where you use row and column labels (names) to select data.
# .iloc[] is integer-based indexing, where you use integer positions (numeric indices) to select data.


area={'california':423967,'Texas':695662,'Newyork':141297,'Florida':170312,'Illinois':149995}
res=pd.Series(area)
print(res)

# Now i want to access a value of california
print(res.iloc[0])

# now i want to acces values of california and texas
print(res.iloc[[0,1]])

# Now i want to access values of from texas to illinois (slicing)
print(res.iloc[1:5])


# dataframe:

area={'california':423967,'Texas':695662,'Newyork':141297,'Florida':170312,'Illinois':149995}
pop={'california':3833251,'Texas':26448193,'Newyork':19651127,'Florida':19552860,'Illinois':12882135}
res=pd.DataFrame({'area':area,'population':pop})
print(res)

# # Now i want to print a commplete row of california
print(res.iloc[0])

# now in want to print complete rows of california andd texas
print(res.iloc[[0,1]])

# now i want to print all rows from texas to illinois
print(res.iloc[1:5])

# now i want to print only area column
print(res['area'])

# now i want to print area value of california
print(res.iloc[0,1])

#now i want to print area values of california and texas 
print(res.iloc[[0,1],1])

#now i want to print area values from texas to illinois
print(res.iloc[1:5,1])

# Key Features of .iloc[]:
# Positional Indexing: It uses numerical indices to select data, unlike .loc[], which uses labels.
# Row and Column Selection: You can specify the row and column indices to access specific data.
# Slicing: It supports slicing, allowing you to select ranges of rows or columns.


california    423967
Texas         695662
Newyork       141297
Florida       170312
Illinois      149995
dtype: int64
423967
california    423967
Texas         695662
dtype: int64
Texas       695662
Newyork     141297
Florida     170312
Illinois    149995
dtype: int64
              area  population
california  423967     3833251
Texas       695662    26448193
Newyork     141297    19651127
Florida     170312    19552860
Illinois    149995    12882135
area            695662
population    26448193
Name: Texas, dtype: int64
              area  population
california  423967     3833251
Texas       695662    26448193
            area  population
Texas     695662    26448193
Newyork   141297    19651127
Florida   170312    19552860
Illinois  149995    12882135
california    423967
Texas         695662
Newyork       141297
Florida       170312
Illinois      149995
Name: area, dtype: int64
3833251
california     3833251
Texas         26448193
Name: population, dtype: int64
Texas       26448193

In [13]:
#EXAMMPLE

area={'california':423967,'Texas':695662,'Newyork':141297,'Florida':170312,'Illinois':149995}
pop={'california':3833251,'Texas':26448193,'Newyork':19651127,'Florida':19552860,'Illinois':12882135}
res=pd.DataFrame({'area':area,'population':pop})
print(res)
# i have these data from these i want to create a new colummn name density(population/area)
res['Density']=res['population']/res['area']
print(res)

              area  population
california  423967     3833251
Texas       695662    26448193
Newyork     141297    19651127
Florida     170312    19552860
Illinois    149995    12882135
              area  population     Density
california  423967     3833251    9.041390
Texas       695662    26448193   38.018740
Newyork     141297    19651127  139.076746
Florida     170312    19552860  114.806121
Illinois    149995    12882135   85.883763


In [14]:
res1=res.T
print(res1)
# here all colummns become rows and all rows become colummns

              california         Texas       Newyork       Florida  \
area        4.239670e+05  6.956620e+05  1.412970e+05  1.703120e+05   
population  3.833251e+06  2.644819e+07  1.965113e+07  1.955286e+07   
Density     9.041390e+00  3.801874e+01  1.390767e+02  1.148061e+02   

                Illinois  
area        1.499950e+05  
population  1.288214e+07  
Density     8.588376e+01  


In [21]:
area={'california':423967,'Texas':695662,'Newyork':141297,'Florida':170312,'Illinois':149995}
pop={'california':3833251,'Texas':26448193,'Newyork':19651127,'Florida':19552860,'Illinois':12882135}
res=pd.DataFrame({'area':area,'population':pop})
print(res)
# from these i want only data form california to newyork along with there area and population values
res1=res.iloc[0:3,0:2]
print(res1)


              area  population
california  423967     3833251
Texas       695662    26448193
Newyork     141297    19651127
Florida     170312    19552860
Illinois    149995    12882135
              area  population
california  423967     3833251
Texas       695662    26448193
Newyork     141297    19651127


In [27]:
#EXAMPLE:
data={'mathematics':[10,20,30,40,50],
      'social marks':[20,40,50,80,90],
      'science marks':[70,30,70,10,90]}
res=pd.DataFrame(data,index=['Ram','Rahul','Rohit','Kohili','jedeja'])
print(res)

# Now i want to find marks of jedeja
print(res.loc['jedeja'])

# Now i want to filter all rows which are science marks greater then 20
res1=res[res['science marks']>20]
print(res1)

# Now i want to filter all rows which are scince marks greater then 10 and socila arks greater then 40
res1=res[(res['science marks']>10) & (res['social marks']>40)]
print(res1)

        mathematics  social marks  science marks
Ram              10            20             70
Rahul            20            40             30
Rohit            30            50             70
Kohili           40            80             10
jedeja           50            90             90
mathematics      50
social marks     90
science marks    90
Name: jedeja, dtype: int64
        mathematics  social marks  science marks
Ram              10            20             70
Rahul            20            40             30
Rohit            30            50             70
jedeja           50            90             90
        mathematics  social marks  science marks
Rohit            30            50             70
jedeja           50            90             90


In [31]:
# OPERATIONS:

import numpy as np
arr=np.random.seed(1001)
arr1=np.random.randint(1,10,4)
print(arr1)
# So here for example i performing a task so here numbers are randomly generated because we use np.random so task become more difficult . so we use np.random.seed ,By specifying a seed (in this case, 1001), you control the sequence of random numbers generated, so every time you run the code, you'll get the same random numbers.

[4 9 4 6]


In [36]:
#EXAMPLE:

data=pd.DataFrame(np.random.randint(0,10,(3,4)),columns=['A','B','C','D'])
print(data)

# now i want to compute sin values for all data
res=np.sin(data)
print(res)

# if i want to compute sin values for a particular column
res=np.sin(data['A'])
print(res)

   A  B  C  D
0  5  0  6  7
1  7  5  3  0
2  2  2  3  2
          A         B         C         D
0 -0.958924  0.000000 -0.279415  0.656987
1  0.656987 -0.958924  0.141120  0.000000
2  0.909297  0.909297  0.141120  0.909297
0   -0.958924
1    0.656987
2    0.909297
Name: A, dtype: float64


In [47]:
# HANDLING WITH MISSING VALUES 

# EXAMPLE: i have a dataframe
array=np.array([[1,np.nan,2],[2,3,5],[np.nan,4,6]])
data=pd.DataFrame(array)
print(data) 

# the problem with these data was there are some nan( not a numbers)

# now to count the nan values in each column
res=data.isnull().sum()
print(res)

# to drop nan values
res=data.dropna()
print(res)
# By default axis=0 so if the rows contains any  nan values so it will drop that row . example if 2nd row have  nan value it will drop complete row

# axis=1
res=data.dropna(axis=1)
print(res)
# axis=1 so if columns have nan values so it will drop that column . so here column 0 and 1 has nan so it drop both columns


# we can also fill nan values
res=data.ffill()#forward fill
print(res)
# so here where nan values are there it will fill with its previous value so here there are 2 nan values 1st one in 1st colummn last value and 2nd oone was in 2nd column 1st values so for 1st nan has previous value and for 2nd nan value there is no previous value 

#similarly
res=data.bfill()#backward fill
print(res)
# so here where nan values are there it will fill with its next value so here there are 2 nan values 1st one in 1st colummn last value and 2nd oone was in 2nd column 1st values so for 1st nan has no next value and for 2nd nan value there is next value 

#we can also fill nan values with mean
res=data.fillna(np.mean(data))
print(res)
# so here there are 2 nan values 1st one in 1st colummn last value and 2nd oone was in 2nd column 1st values so 1st nan value is filled with mean of 1st column and 2nd mmean was filled with 2nd column mean

     0    1    2
0  1.0  NaN  2.0
1  2.0  3.0  5.0
2  NaN  4.0  6.0
0    1
1    1
2    0
dtype: int64
     0    1    2
1  2.0  3.0  5.0
     2
0  2.0
1  5.0
2  6.0
     0    1    2
0  1.0  NaN  2.0
1  2.0  3.0  5.0
2  2.0  4.0  6.0
     0    1    2
0  1.0  3.0  2.0
1  2.0  3.0  5.0
2  NaN  4.0  6.0
          0         1    2
0  1.000000  3.285714  2.0
1  2.000000  3.000000  5.0
2  3.285714  4.000000  6.0


In [None]:
# HIERARCHICAL INDEXING:


In [12]:
# EXAMPLE:

# MULTIINDEX:

import pandas as pd

# first thing we have do is to store all combination of multi-indexes in list of tuples
index=[('california',2000),('california',2001),('california',2010),
       ('new york',2000),('new york',2010),
       ('texas',2000),('texas',2010)]# list of tuples
population=[33871648,37253956,18976457,19378102,20851820,25145561,26448193]
# now we think it is easy to use series because we have only one colunm of value (population)
res=pd.Series(population,index=index)
print(res)

# (california, 2000)    33871648
# (california, 2001)    37253956
# (california, 2010)    18976457
# (new york, 2000)      19378102
# (new york, 2010)      20851820
# (texas, 2000)         25145561
# (texas, 2010)         26448193
# we cannot see these kind indexes in () in excel.
# so we use multi index 

index=[('california',2000),('california',2001),('california',2010),
       ('new york',2000),('new york',2010),
       ('texas',2000),('texas',2010)]
population=[33871648,37253956,18976457,19378102,20851820,25145561,26448193]
res=pd.MultiIndex.from_tuples(index)
res1=pd.Series(population,index=res)
print(res1)

# so we use multi index when we need to store each indexes in seperate column. if we dont use multi index then all indexes are store ina tuple in a single colunm so it is noot correct so we sholud use muliindex when need indexes in seperate columns.

# now by using multi index we get correct form of data as we seen in excel.

# so now i want to convert 2nd column indexes(2000,2001,2010,2000,2010,2000,2010) into column
res=res1.unstack()
print(res)

# again if i want get back to same position
res=res.stack()
print(res)

# so to convert index to column we use unstack()
# so to covert colummn to index we use stack()


(california, 2000)    33871648
(california, 2001)    37253956
(california, 2010)    18976457
(new york, 2000)      19378102
(new york, 2010)      20851820
(texas, 2000)         25145561
(texas, 2010)         26448193
dtype: int64
california  2000    33871648
            2001    37253956
            2010    18976457
new york    2000    19378102
            2010    20851820
texas       2000    25145561
            2010    26448193
dtype: int64
                  2000        2001        2010
california  33871648.0  37253956.0  18976457.0
new york    19378102.0         NaN  20851820.0
texas       25145561.0         NaN  26448193.0
california  2000    33871648.0
            2001    37253956.0
            2010    18976457.0
new york    2000    19378102.0
            2010    20851820.0
texas       2000    25145561.0
            2010    26448193.0
dtype: float64


In [17]:
# from above example i used series because we have only one column population so we use series

# now i have two colunms population and under_18. now for these we use dataframme because we have mmore then 1 colummn
population=[33871648,37253956,18976457,19378102,20851820,25145561,26448193]
under_18=[1234,23345,3456,4567,5678,6789,78910]
index=[('california',2000),('california',2001),('california',2010),
       ('new york',2000),('new york',2010),
       ('texas',2000),('texas',2010)]
res=pd.DataFrame({'population':population,'under_18':under_18},index=pd.MultiIndex.from_tuples(index))
print(res)

                 population  under_18
california 2000    33871648      1234
           2001    37253956     23345
           2010    18976457      3456
new york   2000    19378102      4567
           2010    20851820      5678
texas      2000    25145561      6789
           2010    26448193     78910


In [32]:
# another example: when we have 3 column of indexes where finding combinations likes ('a', 1, 'c'),('a', 1, 'd'),('a', 2, 'c'),('a', 2, 'd'),('b', 1, 'c'),('b', 1, 'd'),('b', 2, 'c'),('b', 2, 'd') are difficult 
# so for these case we use multiindex.form_product()
index=pd.MultiIndex.from_product([['a','b'],[1,2],['c','d']])# here each list is a single index column
print(index)

# or 

index=[('a', 1, 'c'),
            ('a', 1, 'd'),
            ('a', 2, 'c'),
            ('a', 2, 'd'),
            ('b', 1, 'c'),
            ('b', 1, 'd'),
            ('b', 2, 'c'),
            ('b', 2, 'd')]
res=pd.MultiIndex.from_tuples(index)
print(res)


MultiIndex([('a', 1, 'c'),
            ('a', 1, 'd'),
            ('a', 2, 'c'),
            ('a', 2, 'd'),
            ('b', 1, 'c'),
            ('b', 1, 'd'),
            ('b', 2, 'c'),
            ('b', 2, 'd')],
           )
MultiIndex([('a', 1, 'c'),
            ('a', 1, 'd'),
            ('a', 2, 'c'),
            ('a', 2, 'd'),
            ('b', 1, 'c'),
            ('b', 1, 'd'),
            ('b', 2, 'c'),
            ('b', 2, 'd')],
           )


In [30]:
# Another exaple
#using multiindex.from_tuples()
index=[(2013,1),(2013,2),(2014,1),(2014,2)]
res=pd.MultiIndex.from_tuples(index)
print(res)

# OR
# using multiindex.form_product()
index=([[2013,2014],[1,2]])
res=pd.MultiIndex.from_product(index)
print(res)

MultiIndex([(2013, 1),
            (2013, 2),
            (2014, 1),
            (2014, 2)],
           )
MultiIndex([(2013, 1),
            (2013, 2),
            (2014, 1),
            (2014, 2)],
           )


In [3]:
# EXAMPLE:
import numpy as np
import pandas as pd

# in these example we have two multiindexes A) row multiindex B) column multiindex 


# row index
row_index=pd.MultiIndex.from_product([[2013,2014],[1,2]],names=['year','visit'])

# column index
column_index=pd.MultiIndex.from_product([['Bob','Ram','Ravi'],['HR','Temp']],names=['subject','type'])
data=np.random.randn(4,6)# mean=4 and std=6
res=pd.DataFrame(data,index=row_index,columns=column_index)
print(res)



subject          Bob                 Ram                Ravi          
type              HR      Temp        HR      Temp        HR      Temp
year visit                                                            
2013 1      1.683373  1.232347  1.813016 -1.421694 -1.486600 -0.837607
     2      0.371660  1.168389 -1.205759  0.929774  0.252888  1.166112
2014 1      0.922962  0.883070  0.957626  1.997315 -0.794220  0.746227
     2     -0.484017  0.465319  0.373526  1.730725 -1.348634  0.812008


In [None]:
# NOW LET US SEE
# 1. JOIN, CONCATENATE, MERGE
# 2. GROUPBY
# 3. PIVOT TABLES

In [8]:
# 1. CONCATENATION:
# In pandas, concatenation is a way to combine two or more datasets (like tables) into a single one. Imagine you have two tables and you want to either stack them on top of each other or place them side by side.

# Creating two DataFrames
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Concatenating vertically
result = pd.concat([df1, df2], axis=0)
print(result)

# Concatenating horizontally
result = pd.concat([df1, df2], axis=1)
print(result)

# EXAMPLE: COLUMN WISE CONCATINATION
#let us consider two dataframes A and B , i want to perfrom concatenation
A=pd.DataFrame(np.random.randn(2,2),index=[1,2],columns=['A','B'])
B=pd.DataFrame(np.random.randn(2,2),index=[1,2],columns=['A','B'])

# now i want to stack A and  B top of another(axis=0)
# NOTE: for Stacking (axis=0): Column labels should ideally be the same. Row indices can be the same or different.
res=pd.concat([A,B])
print(res)

#EXAMPLE: as we know that for stacking top of another we sholud have same column names and different row indexes 
# now if we have different column names means
A=pd.DataFrame(np.random.randn(2,2),index=[1,2],columns=['A','B'])
B=pd.DataFrame(np.random.randn(2,2),index=[1,2],columns=['C','D'])
#here for stacking top of another we have different column lables in both A and B
res=pd.concat([A,B])
print(res)

# EXAMPLE:here some column names are same
A=pd.DataFrame(np.random.randn(2,3),index=[1,2],columns=['A','B','C'])
B=pd.DataFrame(np.random.randn(2,3),index=[1,2],columns=['D','C','E'])
# here in both A and B we have one column lable C is same
res=pd.concat([A,B])
print(res)

# EXAMPLE:
import numpy as np
import pandas as pd 
A=pd.DataFrame(np.random.randn(2,2),index=[1,2],columns=['A','B'])
B=pd.DataFrame(np.random.randn(2,2),index=[1,2],columns=['A','B'])
#here row indexes are same, when we perform concat column wise in output the row index are not in sequence, so we use ignore_index
res=pd.concat([A,B],ignore_index=True)
print(res)
# now in output we will get row indexes in sequence


# EXAMPLE: ROW WISE CONCATENATION
# for concatenating Side by side (axis=1): Row indices should be same or different. Column labels sholud be same or different.

#EXAMPLE: both row indexes and column labels are same
A=pd.DataFrame(np.random.randn(2,2),index=[1,2],columns=['A','B'])
B=pd.DataFrame(np.random.randn(2,2),index=[1,2],columns=['A','B'])
res=pd.concat([A,B],axis=1)
print(res)

#EXAMPLE :both indexes and column labels are differnt
A=pd.DataFrame(np.random.randn(2,2),index=[1,2],columns=['A','B'])
B=pd.DataFrame(np.random.randn(2,2),index=[3,4],columns=['C','D'])
res=pd.concat([A,B],axis=1)
print(res)


# EXAMPLE: some column labels are same and all row index are same
A=pd.DataFrame(np.random.randn(2,3),index=[1,2],columns=['A','B','C'])
B=pd.DataFrame(np.random.randn(2,3),index=[1,2],columns=['C','D','E'])
res=pd.concat([A,B],axis=1)
print(res)

# NOTE:
# ignore_index=True is only applicable when concatenating along axis=0.
# It is not applicable when concatenating along axis=1.



# NOTE:
# Side by side (axis=1): Row indices don't have to be the same. Column labels don't have to match.
# Stacking (axis=0): Column labels should ideally be the same. Row indices can be the same or different.

   A  B
0  1  3
1  2  4
0  5  7
1  6  8
   A  B  A  B
0  1  3  5  7
1  2  4  6  8
          A         B
1  0.795131  1.234955
2 -0.622598  0.771545
1 -2.199720 -1.500374
2  0.067442  1.477343
          A         B         C         D
1  0.110890 -0.044023       NaN       NaN
2 -0.468709 -2.099122       NaN       NaN
1       NaN       NaN -1.280889 -0.413677
2       NaN       NaN -0.437859  0.258459
          A         B         C         D         E
1 -3.231535 -0.792400  0.420012       NaN       NaN
2  1.165420 -1.014383  0.305027       NaN       NaN
1       NaN       NaN -2.372138  2.715011  1.628323
2       NaN       NaN  0.502544  1.239243  0.879523
          A         B
0  0.619072  1.283240
1 -0.929644 -0.354599
2  1.358623  1.061535
3  0.551677  1.770384
          A         B         A         B
1 -0.176868 -0.951819  0.077641  0.481663
2  0.039454  0.562504  0.162262  0.479474
          A         B         C         D
1 -0.098575  0.190921       NaN       NaN
2  1.316637  0.495

In [58]:
# EXAPMLE:  i have two dataframes
A=pd.DataFrame(np.random.randn(2,3),index=[1,2],columns=['A','B','C'])
B=pd.DataFrame(np.random.randn(2,3),index=[1,2],columns=['B','C','D'])
# now i want to find what are the intersected columns in both A and B. means here in both A and B we have two same columns ['B','C'] we need to find these
res=pd.concat([A,B],join='inner')# inner join gives the intersection columns between two dataframes.
print(res)

          B         C
1  0.690245  0.176697
2  0.767757 -0.864199
1 -0.850026 -0.206755
2  1.214751  0.328345


In [9]:
# MERGING AND JOINING OF DATAFRAMES
# MERGING:Merging in pandas refers to the process of combining two or more DataFrames based on common columns or indices

#EXAMPLE:
A=pd.DataFrame({'student':['Ram','Vinay','Dheera','Varun'],'branch':['ECE','EEE','EEE','CSE']})
B=pd.DataFrame({'student':['Ram','Vinay','Dheera','Varun'],'marks':[40,50,60,80]})
res=pd.merge(A,B)
print(res)

# Here in both dataframes A and B student column is same so based on that it will merge two dataframes

# How it works:
# Common column: This column is used to match rows from both DataFrames. It will appear only once in the merged result.
# Uncommon columns: These are added from both DataFrames and are included in the result.

# The student column is the common column between A and B, and it appears only once in the result.
# The branch column is from A, and it gets added to the result.
# The marks column is from B, and it also gets added to the result.


  student branch  marks
0     Ram    ECE     40
1   Vinay    EEE     50
2  Dheera    EEE     60
3   Varun    CSE     80


In [72]:
#EXAMPLE:
A=pd.DataFrame({'student':['Ram','Vinay','Dheera','Varun'],'branch':['ECE','EEE','EEE','CSE']})
B=pd.DataFrame({'student':['Ram','Vinay','Dheera','Varun'],'marks':[40,50,60,80]})
res=pd.merge(A,B)
print(res)
C=pd.DataFrame({'branch':['ECE','EEE']})
res1=pd.merge(res,C)
print(res1)

  student branch  marks
0     Ram    ECE     40
1   Vinay    EEE     50
2  Dheera    EEE     60
3   Varun    CSE     80
  student branch  marks
0     Ram    ECE     40
1   Vinay    EEE     50
2  Dheera    EEE     60


In [3]:
# EXAMPLE:
A=pd.DataFrame({'student':['Ram','Vinay','Dheera','Varun'],'branch':['ECE','EEE','EEE','CSE']})

B=pd.DataFrame({'branch':['ECE','ECE','EEE','EEE','CSE','CSE'],'skills':['excel','spreadsheet','coding','linux','spreadsheet','organisation']})
res=pd.merge(A,B)
print(res)

  student branch        skills
0     Ram    ECE         excel
1     Ram    ECE   spreadsheet
2   Vinay    EEE        coding
3   Vinay    EEE         linux
4  Dheera    EEE        coding
5  Dheera    EEE         linux
6   Varun    CSE   spreadsheet
7   Varun    CSE  organisation


In [70]:
# EXAMPLE:

A=pd.DataFrame({'student':['Ram','Vinay','Dheera','Varun'],'branch':['ECE','EEE','EEE','CSE']})
B=pd.DataFrame({'name':['Vinay','Ram','Dheera','Varun'],'salary':[10000,20000,30000,40000]})
res = pd.merge(A, B, left_on='student', right_on='name')
print(res)



  student branch    name  salary
0     Ram    ECE     Ram   20000
1   Vinay    EEE   Vinay   10000
2  Dheera    EEE  Dheera   30000
3   Varun    CSE   Varun   40000


In [8]:
# LAST THING ABOUT MERGING:
# 1.HOW=INNER
# 2.HOW=OUTER

#EXAMPLE: HOW=INNER
A=pd.DataFrame({'student':['Ram','Vinay','Dheera','Varun'],'branch':['ECE','EEE','EEE','CSE']})
B=pd.DataFrame({'student':['Ram','Vinay','Dheera','Varun','rana','nani'],'marks':[40,50,60,80,90,100]})
# HERE IN BOTH DATAFRAMES STUDENT COLUMMN HAVE TWO DIFFERENT NAMES RANA AND NANI WHEN WE USE HOW=INNER IT WILL MERGE ONLY SAMME NAMES IN BOOTH DATAFRAMES AND REMMAINING WILL DROP AND BY DEFAULT IT WAS HOW=INNER
res=pd.merge(A,B)
print(res)

#EXAMPLE: HOW=OUTER
# NOW I WANT TO PRINT BOTH SAMME AND NOT SAME NAMES
A=pd.DataFrame({'student':['Ram','Vinay','Dheera','Varun'],'branch':['ECE','EEE','EEE','CSE']})
B=pd.DataFrame({'student':['Ram','Vinay','Dheera','Varun','rana','nani'],'marks':[40,50,60,80,90,100]})
res=pd.merge(A,B,how='outer')
print(res)

  student branch  marks
0     Ram    ECE     40
1   Vinay    EEE     50
2  Dheera    EEE     60
3   Varun    CSE     80
  student branch  marks
0  Dheera    EEE     60
1     Ram    ECE     40
2   Varun    CSE     80
3   Vinay    EEE     50
4    nani    NaN    100
5    rana    NaN     90


In [6]:
import seaborn as sns 
titanic=sns.load_dataset('titanic')
print(titanic)

     survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
0           0       3    male  22.0      1      0   7.2500        S   Third   
1           1       1  female  38.0      1      0  71.2833        C   First   
2           1       3  female  26.0      0      0   7.9250        S   Third   
3           1       1  female  35.0      1      0  53.1000        S   First   
4           0       3    male  35.0      0      0   8.0500        S   Third   
..        ...     ...     ...   ...    ...    ...      ...      ...     ...   
886         0       2    male  27.0      0      0  13.0000        S  Second   
887         1       1  female  19.0      0      0  30.0000        S   First   
888         0       3  female   NaN      1      2  23.4500        S   Third   
889         1       1    male  26.0      0      0  30.0000        C   First   
890         0       3    male  32.0      0      0   7.7500        Q   Third   

       who  adult_male deck  embark_town alive  alo

In [11]:
# GROUP BY:
# groupby() is a function that allows you to split your data into groups based on the values in one or more columns, and then perform operations (such as sum, mean, count, etc.) on each of those groups.

# **`groupby()`** is used to:
# - **Split** data into groups based on the values in one or more columns.
# - **Perform operations** (like `sum()`, `mean()`, `count()`, etc.) on each of those groups.

# ### In summary:
# You can think of it as:
# 1. **Group** the data based on column values.
# 2. **Apply** a function (like sum, mean) to each group.
# 3. **Get results** for each group.

# EXAMPLE:
titanic=sns.load_dataset('titanic')
print(titanic)

# what is the average no of males and females survived---group by
res=titanic.groupby('sex')['survived'].mean()
print(res)

# what is the average no of males and females and class survived
res=titanic.groupby(['sex','class'])['survived'].mean()
print(res)

res=titanic.groupby('sex')['survived'].count()
print(res)

     survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
0           0       3    male  22.0      1      0   7.2500        S   Third   
1           1       1  female  38.0      1      0  71.2833        C   First   
2           1       3  female  26.0      0      0   7.9250        S   Third   
3           1       1  female  35.0      1      0  53.1000        S   First   
4           0       3    male  35.0      0      0   8.0500        S   Third   
..        ...     ...     ...   ...    ...    ...      ...      ...     ...   
886         0       2    male  27.0      0      0  13.0000        S  Second   
887         1       1  female  19.0      0      0  30.0000        S   First   
888         0       3  female   NaN      1      2  23.4500        S   Third   
889         1       1    male  26.0      0      0  30.0000        C   First   
890         0       3    male  32.0      0      0   7.7500        Q   Third   

       who  adult_male deck  embark_town alive  alo

  res=titanic.groupby(['sex','class'])['survived'].mean()


In [25]:
# PIVOT TABLES:
# Pivot tables in Pandas are used to group and aggregate data from a dataset. They allow you to transform and reorganize specific data, providing a summary of that data in a tabular format. This helps in easily analyzing trends, patterns, and relationships within the data.
# Pivot tables in Pandas are used to group specific data and provide a summary of that data in a tabular format. They allow you to aggregate values using functions like sum, mean, or count, making it easier to analyze and interpret data trends and patterns.
#EXAMPLE:
res=titanic.pivot_table('survived',index='sex',columns='class')
print(res)


# I WANT TO FIND SUM
res=titanic.pivot_table('survived',index='sex',columns='class',aggfunc={'survived':np.sum})
print(res)

# I WANT TO FIND MEAN
import numpy as np
res=titanic.pivot_table('survived',index='sex',columns='class',aggfunc={'survived':np.mean})
print(res)

class      First    Second     Third
sex                                 
female  0.968085  0.921053  0.500000
male    0.368852  0.157407  0.135447
class   First  Second  Third
sex                         
female     91      70     72
male       45      17     47
class      First    Second     Third
sex                                 
female  0.968085  0.921053  0.500000
male    0.368852  0.157407  0.135447


  res=titanic.pivot_table('survived',index='sex',columns='class')
  res=titanic.pivot_table('survived',index='sex',columns='class',aggfunc={'survived':np.sum})
  res=titanic.pivot_table('survived',index='sex',columns='class',aggfunc={'survived':np.sum})
  res=titanic.pivot_table('survived',index='sex',columns='class',aggfunc={'survived':np.mean})
  res=titanic.pivot_table('survived',index='sex',columns='class',aggfunc={'survived':np.mean})


  student branch
0     Ram    ECE
1   Vinay    EEE
2  Dheera    EEE
3   Varun    CSE
  student  marks
0     Ram     40
1   Vinay     50
2  Dheera     60
3   Varun     80
4    rana     90
5     sai    100
  student branch  marks
0     Ram    ECE     40
1   Vinay    EEE     50
2  Dheera    EEE     60
3   Varun    CSE     80
