**Pandas-DataFrame and Series**

Pandas is a powerfull data manipulation library in Python, widely used for data analysis and data cleaning. It provides two primary data structures: Series and DataFrame.
A Series is one-dimensional array-like object, while a DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (row and columns).

In [77]:
import pandas as pd

Series

Is one-dimensional array-like object that can hold any data type. It is similar to a column in a table

In [78]:
import pandas as pd

#Create a Series from a list
data=[1,2,3,4,5]
series=pd.Series(data)
print(f"Series: \n {series}")
print(series[2])
print(type(series))

Series: 
 0    1
1    2
2    3
3    4
4    5
dtype: int64
3
<class 'pandas.core.series.Series'>


In [79]:
#Create a Series from a dictionary
dictionary={'a':1,'b':2,'c':3,'d':4,'e':5}
seriesDictionary=pd.Series(dictionary)
print(seriesDictionary)
print(type(seriesDictionary))
print(seriesDictionary['d'])

a    1
b    2
c    3
d    4
e    5
dtype: int64
<class 'pandas.core.series.Series'>
4


In [80]:
#Create a Series from a tuple
tuple2=(10,20,30)
seriesTuple=pd.Series(tuple2)
seriesTuple

0    10
1    20
2    30
dtype: int64

In [81]:
#Using different index according the requirements of the file
data=[10,20,30]
index=['a','b','c']
indexSeries=pd.Series(data,index)
print(indexSeries)

a    10
b    20
c    30
dtype: int64


In [82]:
#Wrong way to give index
dictionary={'a':4,'b':5,'c':6}
index=['x','y','z']
wrongIndexSeries=pd.Series(dictionary,index)
wrongIndexSeries # We have NaN values because the Series is created with the index of the dictionary and after that we give the Series a new index that has no values.

x   NaN
y   NaN
z   NaN
dtype: float64

DataFrame


In [83]:
#Create a Dataframe from a dictionary of list
data={
"Name":["Krish","John","Guevara"],
"Age":[25,30,45],
"City":["Pretoria","New York","Mumbai"]
}

print(data,"\n")

dataFrame=pd.DataFrame(data)
print(dataFrame,"\n")
print(type(dataFrame))

{'Name': ['Krish', 'John', 'Guevara'], 'Age': [25, 30, 45], 'City': ['Pretoria', 'New York', 'Mumbai']} 

      Name  Age      City
0    Krish   25  Pretoria
1     John   30  New York
2  Guevara   45    Mumbai 

<class 'pandas.core.frame.DataFrame'>


In [84]:
#Converts the dataFrame into Numpy
import numpy as np

array=np.array(dataFrame)
array

array([['Krish', 25, 'Pretoria'],
       ['John', 30, 'New York'],
       ['Guevara', 45, 'Mumbai']], dtype=object)

In [85]:
#Create a Dataframe from a list of dictionaries
data=[
{"name":"Krish", "age":25,"city":"Pretoria"},{"name":"John", "age":30,"city":"New York"},{"name":"Guevara", "age":45,"city":"Mumbai"}
]

dataFrame=pd.DataFrame(data)
print(dataFrame)
print(type(dataFrame))


      name  age      city
0    Krish   25  Pretoria
1     John   30  New York
2  Guevara   45    Mumbai
<class 'pandas.core.frame.DataFrame'>


Reading dataFrame

In [86]:
#From a CSV file

dataFrameSales=pd.read_csv("salesData.csv")
print(dataFrameSales.head())
print("")
print(dataFrameSales.tail())



         Date Category  Value   Product  Sales Region
0  2023-01-01        A   28.0  Product1  754.0   East
1  2023-01-02        B   39.0  Product3  110.0  North
2  2023-01-03        C   32.0  Product2  398.0   East
3  2023-01-04        B    8.0  Product1  522.0   East
4  2023-01-05        B   26.0  Product3  869.0  North

          Date Category  Value   Product  Sales Region
45  2023-02-15        B   99.0  Product2  599.0   West
46  2023-02-16        B    6.0  Product1  938.0  South
47  2023-02-17        B   69.0  Product3  143.0   West
48  2023-02-18        C   65.0  Product3  182.0  North
49  2023-02-19        C   11.0  Product3  708.0  North


In [87]:
# Accessing Data from dataframe
#dataFrameSales["Product"].unique()
products=set(dataFrameSales["Product"]) # Select all the names of the products. 
print(products)

type(dataFrameSales["Product"]) # When you select one column of a dataFrame it is converted into a Series

{'Product3', 'Product2', 'Product1'}


pandas.core.series.Series

In [88]:
dataFrameSales.loc[0]

Date        2023-01-01
Category             A
Value             28.0
Product       Product1
Sales            754.0
Region            East
Name: 0, dtype: object

In [89]:
# iloc
dataFrameSales.iloc[0][2]

  dataFrameSales.iloc[0][2]


np.float64(28.0)

Accessing a specified element

In [90]:
#Using the index and the name of the column
dataFrameSales.at[2,"Sales"]

np.float64(398.0)

In [91]:
#Using iat method
#Syntax: df.iat[column index, row index]
dataFrameSales.iat[0,3]


'Product1'

Data manipulation with DataFrame

In [92]:
#Adding a new column
dataFrame["salary"]=[50000,80000,250000]
dataFrame

Unnamed: 0,name,age,city,salary
0,Krish,25,Pretoria,50000
1,John,30,New York,80000
2,Guevara,45,Mumbai,250000


In [93]:
#Removing a column

dataFrame.drop("salary", axis=1, inplace=True) #default: axis=0 (index values), to remove a column we need axis=1. inplace=True makes the operation permanent
dataFrame

Unnamed: 0,name,age,city
0,Krish,25,Pretoria
1,John,30,New York
2,Guevara,45,Mumbai


In [None]:
#Remove a specif row
dataFrame.drop(0,inplace=True)
dataFrame

Unnamed: 0,name,age,city
1,John,31,New York
2,Guevara,46,Mumbai


In [None]:
#Increase the age of everbody in one year
dataFrame["age"]=dataFrame["age"]+1
dataFrame

Unnamed: 0,name,age,city
0,Krish,26,Pretoria
1,John,31,New York
2,Guevara,46,Mumbai


Attributes of the dataFrame

In [96]:
# Display the data types of each column
print(f"Data types: \n{dataFrameSales.dtypes}")
print("")

#Describe the dataFrame
print(f"Statistical summary: \n{dataFrameSales.describe()}")

print("")
#Group by a column and perform an aggregation
grouped= dataFrameSales.groupby("Product")["Value"].mean()
print(f"Mean value by category: \n{grouped}")

Data types: 
Date         object
Category     object
Value       float64
Product      object
Sales       float64
Region       object
dtype: object

Statistical summary: 
           Value       Sales
count  47.000000   46.000000
mean   51.744681  557.130435
std    29.050532  274.598584
min     2.000000  108.000000
25%    27.500000  339.000000
50%    54.000000  591.500000
75%    70.000000  767.500000
max    99.000000  992.000000

Mean value by category: 
Product
Product1    46.214286
Product2    52.800000
Product3    55.166667
Name: Value, dtype: float64
