### Pandas ( Data Analysis and Science Concepts)

Pandas is an open-source python package, it is basically designed for data manipulation, data reshaping, indexing, merging, transformation, visualization, pivots, groupby, spliting etc.

Its provides powerful, easy to use data structure.

* Series and DataFrame

* Series : In layman way, series is just a excel column or Series is a one-dimensonal labeled array that is capable of holding any kind of data type such as integer, string, boolean, date etc

* DataFrame : In layman way, DataFrame is just like excel spredsheet means combination of rows and column, More specifically In Pandas, DataFrame is the combination of Index and Series.
    * Note: Index => Row and Series => Column

* SQL : Records (Rows) and Fields (Columns)
* DataFrame : Index (Rows) and Series (Columns)
* Excel: Rows and Column

### Pandas is Useful for ?

* Data Manipulation
* Handling Missing Data Points
* Perfrom different type of operations
* Data Wrangling : Merge, Join, Reshapse, Group, Pivot etc
* Very powerful for File Handling (CSV, Excel, JSON, SQL etc)
* Data Transformation and Filtering
* Data Visualization and Automation

### Where you can use the Pandas?
* Data Cleaning and Transformation
* Data Analysis
* Exploratory Data Analysis (EDA)
* Finanical and Statistical Analysis
* Data Engineering

In [1]:
import pandas as pd

### Pandas Data Structure:

* Series (Columns)
* DataFrame (Rows x Columns)

In [2]:
sr = pd.Series([10,11,45,20,50])

In [3]:
sr

0    10
1    11
2    45
3    20
4    50
dtype: int64

In [4]:
sr = pd.Series()

In [5]:
sr

Series([], dtype: object)

In [6]:
print(sr) # object (object data type is nothing : it is just a string data type)

Series([], dtype: object)


In [10]:
sr = pd.Series(data = None, dtype = 'int64')

In [11]:
sr

Series([], dtype: int64)

In [13]:
# print(sr.__doc__)

In [15]:
prices = [750000,390000,5570000,1780000,1091000]
carnames = ['Swift','Santro',"Audi",'Creta','Bolero']

In [16]:
car_series = pd.Series(data = prices )

In [17]:
car_series

0     750000
1     390000
2    5570000
3    1780000
4    1091000
dtype: int64

In [18]:
car_series_1 = pd.Series(data = prices, index = carnames)

In [19]:
car_series_1

Swift      750000
Santro     390000
Audi      5570000
Creta     1780000
Bolero    1091000
dtype: int64

In [20]:
type(car_series)

pandas.core.series.Series

In [21]:
type(car_series_1)

pandas.core.series.Series

In [22]:
pd.Series(prices, carnames, name = 'Price')

Swift      750000
Santro     390000
Audi      5570000
Creta     1780000
Bolero    1091000
Name: Price, dtype: int64

In [23]:
prices = [750000,390000,5570000,1780000,1091000]
carnames = ['Swift','Santro',"Audi",'Creta','Bolero']

In [26]:
car_prices = dict(zip(carnames,prices))

In [27]:
car_prices

{'Swift': 750000,
 'Santro': 390000,
 'Audi': 5570000,
 'Creta': 1780000,
 'Bolero': 1091000}

In [31]:
sr = pd.Series(data = car_prices, name = "PriceOfCar")

In [29]:
car_prices

{'Swift': 750000,
 'Santro': 390000,
 'Audi': 5570000,
 'Creta': 1780000,
 'Bolero': 1091000}

In [32]:
sr

Swift      750000
Santro     390000
Audi      5570000
Creta     1780000
Bolero    1091000
Name: PriceOfCar, dtype: int64

In [33]:
sr.name

'PriceOfCar'

In [34]:
sr.index

Index(['Swift', 'Santro', 'Audi', 'Creta', 'Bolero'], dtype='object')

In [35]:
sr.values

array([ 750000,  390000, 5570000, 1780000, 1091000], dtype=int64)

In [36]:
sr.shape

(5,)

In [37]:
sr.ndim

1

In [38]:
sr

Swift      750000
Santro     390000
Audi      5570000
Creta     1780000
Bolero    1091000
Name: PriceOfCar, dtype: int64

In [39]:
sr[sr < 1500000]

Swift      750000
Santro     390000
Bolero    1091000
Name: PriceOfCar, dtype: int64

In [43]:
(sr > 200000) & (sr < 800000)

Swift      True
Santro     True
Audi      False
Creta     False
Bolero    False
Name: PriceOfCar, dtype: bool

In [44]:
sr[(sr > 200000) & (sr < 800000)]

Swift     750000
Santro    390000
Name: PriceOfCar, dtype: int64

In [45]:
car_prices

{'Swift': 750000,
 'Santro': 390000,
 'Audi': 5570000,
 'Creta': 1780000,
 'Bolero': 1091000}

In [46]:
type(car_prices)

dict

### Creating a DataFrame

In [47]:
df = pd.DataFrame()

In [48]:
print(df)

Empty DataFrame
Columns: []
Index: []


In [50]:
df.shape # shape function will returns a tuple where 1st item your rows and 2nd item your columns

(0, 0)

In [51]:
df = pd.DataFrame(car_prices)

ValueError: If using all scalar values, you must pass an index

In [52]:
car_prices

{'Swift': 750000,
 'Santro': 390000,
 'Audi': 5570000,
 'Creta': 1780000,
 'Bolero': 1091000}

In [55]:
# solution of above error
df = pd.DataFrame(car_prices, index = ["A"])

In [56]:
df

Unnamed: 0,Swift,Santro,Audi,Creta,Bolero
A,750000,390000,5570000,1780000,1091000


In [57]:
car_prices

{'Swift': 750000,
 'Santro': 390000,
 'Audi': 5570000,
 'Creta': 1780000,
 'Bolero': 1091000}

In [59]:
car_prices.values()

dict_values([750000, 390000, 5570000, 1780000, 1091000])

In [68]:
car_dict = dict(map(lambda x : (x[0],[x[1]]), zip(carnames,prices)))

In [67]:
for key, value in car_prices.items():
    print({key : [value]})

{'Swift': [750000]}
{'Santro': [390000]}
{'Audi': [5570000]}
{'Creta': [1780000]}
{'Bolero': [1091000]}


In [69]:
pd.DataFrame(car_dict)

Unnamed: 0,Swift,Santro,Audi,Creta,Bolero
0,750000,390000,5570000,1780000,1091000


In [70]:
employee = {"Name" : "Aman"}

In [73]:
pd.DataFrame(employee, index = ["A"])

Unnamed: 0,Name
A,Aman


In [74]:
employee = {"Name" : ["Aman"]}
pd.DataFrame(employee)

Unnamed: 0,Name
0,Aman


In [77]:
EmployeeDb = {"Name" : ["Ankit","Aman","Abhishek","Manya","Sugandha"],
             "Salary" : [10000,11000,9000,3000,40000]}

In [78]:
pd.DataFrame(EmployeeDb)

Unnamed: 0,Name,Salary
0,Ankit,10000
1,Aman,11000
2,Abhishek,9000
3,Manya,3000
4,Sugandha,40000


In [79]:
import pandas as pd

In [None]:
# !pip uninstall pandas
# !pip install pandas
# !pip install pandas==1.21.3

In [None]:
# Reading the data from different-2 sources

### OS Module:

https://youtu.be/Tp4qTuHROX4?si=fhivb5hODcoHkXat

In [80]:
import os

In [81]:
os.getcwd()

'C:\\Users\\AEPAC\\Desktop\\KnowledgeHut\\2024\\2024_KnowledgeHut\\Data Analyst (MID - 14th Sep)\\Python\\13th Oct - Python Lambda and Modules - Pandas'

In [82]:
os.listdir()

['.ipynb_checkpoints',
 '2015.csv',
 'breast_cancer.txt',
 'EmployeeDB.csv',
 'gnp - Copy.sas7bdat',
 'iris.json',
 'Lambda function.ipynb',
 'Python Pandas - Data Analysis Journey.ipynb',
 'Superstore.xls']

In [83]:
df_csv = pd.read_csv("EmployeeDB.csv")

In [84]:
# can you display first 5 records
df_csv.head()

Unnamed: 0,EmpID,FirstName,LastName,Education,Occupation,Grade,YearlyIncome,Sales,HireDate,DeptID
0,1870,Annie,Jenkins,Bachelors,Management,A,35000,4650,1899-12-30 00:00:00.000,9
1,8843,Benjamin,Willis,Master Degree,Professional,B,50000,4093,1899-12-30 00:00:00.000,8
2,3727,Christopher,Oliver,High School,Management,B,50000,5555,1899-12-30 00:00:00.000,4
3,9641,Kimberly,Coleman,Intermediate,Professional,D,50000,3248,1899-12-30 00:00:00.000,7
4,1171,Judy,Sanchez,Master Degree,Professional,D,70000,3014,1899-12-30 00:00:00.000,4


In [85]:
df_csv.shape

(17, 10)

In [86]:
excel = pd.read_excel("Superstore.xls")

In [87]:
excel.shape

(9994, 21)

In [88]:
excel.head(1)

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136


In [89]:
txt = pd.read_csv("breast_cancer.txt")

In [90]:
txt.head(2)

Unnamed: 0,id,clump_thickness,unif_cell_size,uniform_cell_shape,marginal_adhesion,single_epi_cell_size,bare_nuclei,bland_chromation,normal_nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2


In [91]:
txt.shape

(699, 11)

### JSON
Java Script Object Notation (JSON) : Json data just like dictionary of Python.

In [92]:
json = pd.read_json("iris.json")

In [93]:
json.head(1)

Unnamed: 0,sepalLength,sepalWidth,petalLength,petalWidth,species
0,5.1,3.5,1.4,0.2,setosa


In [94]:
json.shape

(150, 5)

### SAS : Statistical Analysis Services

In [95]:
sas = pd.read_sas("gnp - Copy.sas7bdat")

In [96]:
sas.head(2)

Unnamed: 0,DATE,GNP,CONSUMP,INVEST,EXPORTS,GOVT
0,1960-01-01,516.1,325.5,88.7,4.3,97.6
1,1960-04-01,514.5,331.6,78.1,5.1,99.6


In [97]:
sas.shape

(126, 6)

In [100]:
clipboard = pd.read_clipboard()

In [101]:
clipboard

Unnamed: 0,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name
0,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute
1,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute
2,6/12/2016,6/16/2016,Second Class,DV-13045,Darrin Van Huff


In [104]:
pd.read_html(r"https://en.wikipedia.org/wiki/Ratan_Tata")[0]

Unnamed: 0,Ratan TataGBE AO,Ratan TataGBE AO.1
0,Tata in 2005,Tata in 2005
1,Born,"Ratan Naval Tata 28 December 1937 Bombay, Bomb..."
2,Died,"9 October 2024 (aged 86) Mumbai, Maharashtra, ..."
3,Alma mater,Cornell University (BArch)
4,Occupations,IndustrialistPhilanthropist
5,Title,"Chairman Emeritus, Tata Sons and Tata Group[1]"
6,Term,1991–20122016–2017
7,Predecessor,J. R. D. Tata
8,Successor,Cyrus Mistry (2012–2016)Natarajan Chandrasekar...
9,Parents,Naval Tata (father)Sooni Commissariat (mother)


* df.to_json
* df.to_csv
* df.to_sql
* df.to_excel etc

* DataFrame Operations
* Pivot Table
* Split, Apply, Combined, Join, Merge
* Pandas Stats also