### Pandas Library

* Pandas is a powerful and popular Python Library for data manipulation, analysis, and exploration. It is useful for working with structured data such as tables or time series and provides data strucutres and functions to make tasks more efficient and easy.

* Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.

***Import Pandas Libraries***

* To implement any activity through oandas we need to import pandas.

In [1]:
import pandas as pd

***Creation of Pandas Series & dataframe***


***DataFrame***

* The below code creates a Pandas DataFrame object using "pd.DataFrame()" method.

* A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [2]:
mydataset = {'cars':["BMW","Volvo","Ford"],
             'passings':[3,7,2]}
myvar = pd.DataFrame(mydataset)
print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


In [3]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)

   calories  duration
0       420        50
1       380        40
2       390        45


***Formation of different DataFrame from different datatypes***

* From a Dictionary of Lists or Arrays

In [34]:
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 40], 'City' : ['Hyderabad','Delhi', 'Chennai']}
df = pd.DataFrame(data)
print(df)

      Name  Age       City
0    Alice   25  Hyderabad
1      Bob   30      Delhi
2  Charlie   40    Chennai


*  From a List of Dictionaries

In [35]:
data = [{'Name': 'Avanthi','age' : 20, 'city' : 'Ahemdabad'},
        {'Name': 'Bhaskar', 'age' : 15, 'city' : 'Banglore'},
        {'Name': 'Charan', 'age' : 35, 'city' : 'Chennai'}]
df = pd.DataFrame(data)
print(df)

      Name  age       city
0  Avanthi   20  Ahemdabad
1  Bhaskar   15   Banglore
2   Charan   35    Chennai


* From a List of Lists or Tuples (with Column Names)

In [36]:
data = [[1, 'Aishu'], [2, 'Bhargav'], [3, 'Chandu']]
df = pd.DataFrame(data, columns = ['ID', 'Name'])
print(df)

   ID     Name
0   1    Aishu
1   2  Bhargav
2   3   Chandu


* From a NumPy Array (with Column Names)

In [37]:
data = np.array([[1, 'Alice'], [2, 'Bob'], [3, 'Charlie']])
df = pd.DataFrame(data, columns=['ID', 'Name'])
print(df)

  ID     Name
0  1    Alice
1  2      Bob
2  3  Charlie


* From a Series (as Columns)

In [38]:
series1 = pd.Series([1, 2, 3], name='ID')
series2 = pd.Series(['Alice', 'Bob', 'Charlie'], name='Name')
df = pd.DataFrame({series1.name: series1, series2.name: series2})
print(df)

   ID     Name
0   1    Alice
1   2      Bob
2   3  Charlie


* From a List of Tuples (with Named Columns)

In [39]:
data = [(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')]
df = pd.DataFrame(data, columns=['ID', 'Name'])
print(df)

   ID     Name
0   1    Alice
1   2      Bob
2   3  Charlie


* Creating DataFrames with Index and Custom Column Names

In [40]:
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
index = ['A', 'B', 'C']
columns = ['X', 'Y', 'Z']
df = pd.DataFrame(data, index=index, columns=columns)
print(df)

   X  Y  Z
A  1  2  3
B  4  5  6
C  7  8  9


* Creating an Empty DataFrame

In [42]:
df = pd.DataFrame(columns=['A', 'B', 'C'])
print(df)

Empty DataFrame
Columns: [A, B, C]
Index: []


* DataFrame with NaN values

In [43]:
import numpy as np
df = pd.DataFrame([[np.nan, 2, 3], [4, np.nan, 6], [7, 8, 9]], columns=['A', 'B', 'C'])
print(df)

     A    B  C
0  NaN  2.0  3
1  4.0  NaN  6
2  7.0  8.0  9


* Creating DataFrames with MultiIndex (Hierarchical Indexing)

In [44]:
arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]]
index = pd.MultiIndex.from_arrays(arrays, names=('Letters', 'Numbers'))
df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], index=index, columns=['Col1', 'Col2'])
print(df)

                 Col1  Col2
Letters Numbers            
A       1           1     2
        2           3     4
B       1           5     6
        2           7     8


***Series***

* The below code creates a Pandas Series object using the "pd.Series()" method.

* A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.

In [4]:
a = [1,7,2]
df = pd.Series(a)
print(df)

0    1
1    7
2    2
dtype: int64


In [5]:
a = [1,2,3]
df = pd.Series(a, index = ["X", "Y", "Z"])
print(df)

X    1
Y    2
Z    3
dtype: int64


In [6]:
pds = pd.Series(list("abcdef"),range(0,6,1))
pds

0    a
1    b
2    c
3    d
4    e
5    f
dtype: object

In [7]:
pds5=pd.Series(list("7896"),range(4,8))
pds5

4    7
5    8
6    9
7    6
dtype: object

***Formation of series from different data types***

* From a List

In [29]:
data = [1,2,3,4,5]
series = pd.Series(data)
print(series)

0    1
1    2
2    3
3    4
4    5
dtype: int64


* From a Dictionary

In [30]:
data = {"a":1, "b": 2, "c":3}
series = pd.Series(data)
print(series)

a    1
b    2
c    3
dtype: int64


* From a NumPy Array


In [31]:
import numpy as np
data = np.array([10, 20, 30, 40])
series = pd.Series(data)
print(series)

0    10
1    20
2    30
3    40
dtype: int64


* From a Tuple



In [32]:
data = (1,2,3,4)
series = pd.Series(data)
print(series)

0    1
1    2
2    3
3    4
dtype: int64


* With a Specific Index

In [33]:
data = [100, 200, 300]
index = ['X', 'Y', 'Z']
series = pd.Series(data, index = index)
print(series)

X    100
Y    200
Z    300
dtype: int64


***Accessing elements of Series***

Pandas provides you to access the elements in the series.

* Accessing by Index Position(Using Integer Indexing)

In [53]:
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)

print(series.iloc[0])
print("-----------------")
print(series.iloc[2])
print("-----------------")
print(series.iloc[1:3])
print("-----------------")
print(series.iloc[2:])
print("-----------------")
print(series.iloc[:4])
print("-----------------")
print(series.iloc[::1])
print(series.iloc[::3])
print("-----------------")



0    10
1    20
2    30
3    40
4    50
dtype: int64
10
-----------------
30
-----------------
1    20
2    30
dtype: int64
-----------------
2    30
3    40
4    50
dtype: int64
-----------------
0    10
1    20
2    30
3    40
dtype: int64
-----------------
0    10
1    20
2    30
3    40
4    50
dtype: int64
0    10
3    40
dtype: int64
-----------------


In [52]:
print(series.iloc[2::])

2    30
3    40
4    50
dtype: int64


* Using Direct Indexing (for 0-based index)

In [57]:
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)
print("---------------")
print(series[0])
print("---------------")
print(series[3])
print("---------------")
print(series[0:3])

0    10
1    20
2    30
3    40
4    50
dtype: int64
---------------
10
---------------
40
---------------
0    10
1    20
2    30
dtype: int64


* Accessing by Index Lable(Using Label-based Indexing)

In [60]:
data = [100, 150, 200, 250, 300, 350, 400, 450, 500]
index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
series = pd.Series(data, index = index)

print(series)
print("---------------")
print(series.loc['a'])
print("---------------")
print(series.loc['b'])
print("---------------")
print(series.loc['h'])

a    100
b    150
c    200
d    250
e    300
f    350
g    400
h    450
i    500
dtype: int64
---------------
100
---------------
150
---------------
450


* Using Direct Label Indexing

In [65]:
print(series['c'])
print("----------------")
print(series['i'])
print("----------------")
print(series['a':'g'])
print("----------------")
print(series['g':])
print("----------------")
print(series[:'d'])

200
----------------
500
----------------
a    100
b    150
c    200
d    250
e    300
f    350
g    400
dtype: int64
----------------
g    400
h    450
i    500
dtype: int64
----------------
a    100
b    150
c    200
d    250
dtype: int64


* Accessing Multiple Elements :

a. Using .iloc[ ] and .loc[ ] for Multiple Index Positions

In [69]:
print(series.iloc[[0, 2]])
print("-----------------")
print(series.loc[['a', 'd']])

a    100
c    200
dtype: int64
-----------------
a    100
d    250
dtype: int64


b. Using Boolean Indexing 

In [70]:
print(series[series > 20])

a    100
b    150
c    200
d    250
e    300
f    350
g    400
h    450
i    500
dtype: int64


***Operation on Series***

In Pandas we can perform element wise operations which combines or maipulates the series 

In [11]:
pds+pds1

0    al
1    bm
2    cn
3    do
4    ep
5    fq
dtype: object

In [12]:
index_labels = ['a', 'b', 'c', 'd', 'e']
data_string = ['apple', 'banana', 'cherry', 'date', 'elderberry']
n = pd.Series(index_labels)
s = pd.Series(data_string)
print(n)
print(s)
print(n+s)

0    a
1    b
2    c
3    d
4    e
dtype: object
0         apple
1        banana
2        cherry
3          date
4    elderberry
dtype: object
0         aapple
1        bbanana
2        ccherry
3          ddate
4    eelderberry
dtype: object


* df2 is a dictionary where the keys "alfa" and "beta" are assigned to pds4 and pds5, respectively.
pd_df2 = pd.DataFrame(df2) creates a pandas DataFrame using the dictionary df2.

In [13]:
pds4=pd.Series(["blue","red","green","violet","white","black"],range(0,6,1))
pds5=pd.Series(list("7896"),range(4,8))
df2={"alfa":pds4,"beta":pds5}
pd_df2=pd.DataFrame(df2)
pd_df2

Unnamed: 0,alfa,beta
0,blue,
1,red,
2,green,
3,violet,
4,white,7.0
5,black,8.0
6,,9.0
7,,6.0


***Accessing a particular column***

In the given code we assigned a key "alfa" to "pds4" series and with that key we can access the the column from the above dataframe. 

In [14]:
pds4=pd.Series(["blue","red","green","violet","white","black"],range(0,6,1))
pds5=pd.Series(list("7896"),range(4,8))
df2={"alfa":pds4,"beta":pds5}
pd_df2=pd.DataFrame(df2)
pd_df2["alfa"]

0      blue
1       red
2     green
3    violet
4     white
5     black
6       NaN
7       NaN
Name: alfa, dtype: object

In [15]:
categories = ['small', 'medium', 'large', 'medium', 'small']
series_categories = pd.Series(categories)
print(series_categories)
print(series_categories[0])
print(series_categories[4])
print(series_categories[3])

0     small
1    medium
2     large
3    medium
4     small
dtype: object
small
small
medium


***Accessing a group of particular columns***

*  The code aims to create a DataFrame from the dictionary df1 and then extract specific columns from it. This process demonstrates how to manipulate and access data from DataFrames in Pandas.

In [16]:
pds4=pd.Series(["blue","red","green","violet","white","black"],range(0,6,1))
pds5=pd.Series(list("7896"),range(4,8))
df2={"alfa":pds4,"beta":pds5}
pd_df2=pd.DataFrame(df2)

In [17]:
# df.loc[row_label, column_label]
pd_df2.loc[7,"beta"]

'6'

***Accessing a particlar cell using index and column names***

The ".loc" index in pandas is used for label-based indexing, allowing you to access a group of rows and columns by labels and boolean array. It provide a way to select and manipulate specific parts of DataFrame or Series.

In [18]:
area={'california':423967,'Texas':695662,'Newyork':141297,'Florida':170312,'Illinois':149995}
res=pd.Series(area)
print(res)

california    423967
Texas         695662
Newyork       141297
Florida       170312
Illinois      149995
dtype: int64


In [19]:
print(res.loc['california'])
print(res.loc[['california','Texas']])
print(res.loc['Texas':'Florida'])

423967
california    423967
Texas         695662
dtype: int64
Texas      695662
Newyork    141297
Florida    170312
dtype: int64


In [20]:
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data, index=['a', 'b', 'c'])
print(df)

# Select row by label 'b'
print(df.loc['b'])
print(df.loc['a'])


      Name  Age
a    Alice   25
b      Bob   30
c  Charlie   35
Name    Bob
Age      30
Name: b, dtype: object
Name    Alice
Age        25
Name: a, dtype: object


In [21]:
area={'california':423967,'Texas':695662,'Newyork':141297,'Florida':170312,'Illinois':149995}
pop={'california':3833251,'Texas':26448193,'Newyork':19651127,'Florida':19552860,'Illinois':12882135}
res=pd.DataFrame({'area':area,'population':pop})
print(res)

              area  population
california  423967     3833251
Texas       695662    26448193
Newyork     141297    19651127
Florida     170312    19552860
Illinois    149995    12882135


In [22]:
print(res.loc['california'])
print(res.loc['Newyork'])
print(res.loc)

area           423967
population    3833251
Name: california, dtype: int64
area            141297
population    19651127
Name: Newyork, dtype: int64
<pandas.core.indexing._LocIndexer object at 0x0000020F48965EA0>


In [23]:
print(res.loc["california","area"])

423967


***Adding a new column to a dataframe***

In [26]:
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})
df['C'] = 10
print(df)

   A  B   C
0  1  4  10
1  2  5  10
2  3  6  10


In [27]:
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6]})
df['C'] = [12,20,30]
print(df)


   A  B   C
0  1  4  12
1  2  5  20
2  3  6  30


***Removing a particular column from dataframe***

In [73]:
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)

print(df)
print("----------------")
df = df.drop('Age',axis = 1)
print(df)


      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
----------------
      Name         City
0    Alice     New York
1      Bob  Los Angeles
2  Charlie      Chicago


In [76]:
# Removing 'Age' and 'City' columns
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
df = df.drop(['Age', 'City'], axis=1)
print(df)


      Name
0    Alice
1      Bob
2  Charlie


In [78]:
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
df.drop('City', axis = 1, inplace = True)
print(df)

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


***Removing a particlaur row***

In [79]:
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
df = df.drop(1)
print(df)

      Name  Age      City
0    Alice   25  New York
2  Charlie   35   Chicago


In [80]:
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
df = df.drop([0,2])
print(df)

  Name  Age         City
1  Bob   30  Los Angeles


In [81]:
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
df.drop(1, axis=0, inplace=True)
print(df)

      Name  Age      City
0    Alice   25  New York
2  Charlie   35   Chicago


***Getting the dimension of a dataframe***

In [82]:
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
df.shape


(3, 3)

***Filling with specified values to a nulls***

In [83]:
data = [10, np.nan, 30, np.nan, 50]
series = pd.Series(data)

# Fill NaN values with a specific value (e.g., 0)
filled_series = series.fillna(0)
print(filled_series)

0    10.0
1     0.0
2    30.0
3     0.0
4    50.0
dtype: float64


***Grouping based on column values and applying a relationship on other columns***

In [84]:
import pandas as pd

data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Value1': [10, 20, 30, 40, 50, 60],
    'Value2': [5, 10, 15, 20, 25, 30]
}
df = pd.DataFrame(data)

grouped_df = df.groupby('Category').agg({'Value1': 'sum', 'Value2': 'sum'})
print(grouped_df)


          Value1  Value2
Category                
A             90      45
B            120      60


In [85]:
df['Value1_zscore'] = df.groupby('Category')['Value1'].transform(lambda x: (x - x.mean()) / x.std())
print(df)


  Category  Value1  Value2  Value1_zscore
0        A      10       5           -1.0
1        B      20      10           -1.0
2        A      30      15            0.0
3        B      40      20            0.0
4        A      50      25            1.0
5        B      60      30            1.0


***Desribing charaterstics of column with categorical data***

In [86]:
df.describe

<bound method NDFrame.describe of   Category  Value1  Value2  Value1_zscore
0        A      10       5           -1.0
1        B      20      10           -1.0
2        A      30      15            0.0
3        B      40      20            0.0
4        A      50      25            1.0
5        B      60      30            1.0>

***Dataframe concatenation***

In [87]:
df1 = pd.DataFrame({
    'A': [1, 2],
    'B': [3, 4]
})

df2 = pd.DataFrame({
    'A': [5, 6],
    'B': [7, 8]
})

df_concat = pd.concat([df1, df2], axis=0)
print(df_concat)


   A  B
0  1  3
1  2  4
0  5  7
1  6  8


In [88]:
df_concat = pd.concat([df1, df2], axis=0, ignore_index=True)
print(df_concat)

   A  B
0  1  3
1  2  4
2  5  7
3  6  8


In [89]:
df1 = pd.DataFrame({
    'A': [1, 2],
    'B': [3, 4]
})

df2 = pd.DataFrame({
    'C': [5, 6],
    'D': [7, 8]
})

df_concat = pd.concat([df1, df2], axis=1)
print(df_concat)


   A  B  C  D
0  1  3  5  7
1  2  4  6  8


In [90]:
df1 = pd.DataFrame({
    'A': [1, 2],
    'B': [3, 4]
})

df2 = pd.DataFrame({
    'B': [5, 6],
    'C': [7, 8]
})

df_concat = pd.concat([df1, df2], axis=0, ignore_index=True)
print(df_concat)


     A  B    C
0  1.0  3  NaN
1  2.0  4  NaN
2  NaN  5  7.0
3  NaN  6  8.0


### ***Excercise***

In [1]:
import pandas as pd 

df = pd.Series([5,15,25,35,45,55])
print(df)


0     5
1    15
2    25
3    35
4    45
5    55
dtype: int64


In [2]:
df1 = pd.Series([10,20,30,40,50,60])
print(df1)

0    10
1    20
2    30
3    40
4    50
5    60
dtype: int64


In [4]:
print(df.loc[2])

25


In [11]:
df2 = pd.DataFrame({"Column1": df, "Column2": df1})
print(df2)

   Column1  Column2
0        5       10
1       15       20
2       25       30
3       35       40
4       45       50
5       55       60


In [None]:
axis = df.loc[1,"Column1"]
print(axis)

In [15]:
import pandas as pd

data = {
    "four": ["blue", "red", "green", "white", "black"],
    "one": ["a", "b", "c", "e", "f"],
    "three": ["apple", "orange", "banana", "melon", "pineapple"],
    "two": ["l", "m", "n", "p", "q"],
    "ten": [10, 20, 30, 50, 60]
}

df = pd.DataFrame(data)

sum_of_ten = df["ten"].sum()
print("Sum of all items in column 'ten':", sum_of_ten)

fruits = df["three"].tolist()
print("\nAll fruits in the DataFrame:")
print(fruits)

column_ten_description = df["ten"].describe()
print(column_ten_description)

Sum of all items in column 'ten': 170

All fruits in the DataFrame:
['apple', 'orange', 'banana', 'melon', 'pineapple']
count     5.000000
mean     34.000000
std      20.736441
min      10.000000
25%      20.000000
50%      30.000000
75%      50.000000
max      60.000000
Name: ten, dtype: float64
