In [1]:
import pandas as pd
import numpy as np

<h3>Pandas has a named index instead of numeric index as numpy</h3>

<h2>Pandas Series</h2>

this provides a pandas series, where the index is numerical as only the data was given.
To clarify the index we can pass index=(index labels)

In [2]:
myindex = ['USA','Canada','Mexico']
mydata = [1776,1867,1821]
myser = pd.Series(data=mydata,index=myindex) #it's a pandas series
myser

USA       1776
Canada    1867
Mexico    1821
dtype: int64

can now grab information almost as if it was a dictionary

In [3]:
myser[0] #searching for things numerically can be more confusing, this is why index is better
myser['USA'] #both return the same

  myser[0] #searching for things numerically can be more confusing, this is why index is better


1776

can pass dictionary to include both index and value dic = {index1:value1,index2:value2}

In [4]:
ages = {'Sam':5,'Frank':10,'Spike':7}
pd.Series(ages) #Makes the keys as the index, and the values as the data

Sam       5
Frank    10
Spike     7
dtype: int64

In [5]:
q1 = {'Japan':80,'China':450,'India':200,'USA':250}
q2 = {'Brazil':100,'China':500,'India':100,'USA':260}
#imaginery data

In [6]:
sales_q1 = pd.Series(q1)
sales_q2 = pd.Series(q2)

In [7]:
sales_q1.keys() #returns keys

Index(['Japan', 'China', 'India', 'USA'], dtype='object')

since Pandas is built on Numpy we can broadcast (i.e sales_q1 * 2)
on a normal Python list [1,2] * 2 gets turn into [1,2][1,2]
but on numpy/pandas it turns into [2,4]

In [8]:
sales_q1 + sales_q2 #when doing this, if a key its not in both panda series, it replaces the value with NaN

Brazil      NaN
China     950.0
India     300.0
Japan       NaN
USA       510.0
dtype: float64

<h3>Fill Value</h3>

Since we clarify that fill_value is 0, for whenever there is no value
we can get the normal values now, since before on q1 we had Japan but on q2 we didn't
so it was trying to do Japan's value + NaN = NaN.

Since we did fill_value, it is trying to do Japan's value + 0 = Japan's value

That is why we can see the values now

In [9]:
sales_q1.add(sales_q2,fill_value=0)

Brazil    100.0
China     950.0
India     300.0
Japan      80.0
USA       510.0
dtype: float64

<h2>Pandas DataFrame</h2>

DataFrames is a table of columns and rows in pandas that can easily restructure and filter.

A group of Pandas Series objects that can share the same index

In [10]:
np.random.seed(101)
mydata = np.random.randint(0,101,(4,3))
mydata

array([[95, 11, 81],
       [70, 63, 87],
       [75,  9, 77],
       [40,  4, 63]])

In [11]:
myindex = ['CA','NY','AZ','TX']
mycolumns = ['Jan','Feb','Mar']

df = pd.DataFrame(index=myindex,columns=mycolumns,data=mydata)
df

Unnamed: 0,Jan,Feb,Mar
CA,95,11,81
NY,70,63,87
AZ,75,9,77
TX,40,4,63


In [12]:
df.info() #reports a short info about the DataFrame

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, CA to TX
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Jan     4 non-null      int32
 1   Feb     4 non-null      int32
 2   Mar     4 non-null      int32
dtypes: int32(3)
memory usage: 80.0+ bytes


<h3>How to Read DATA from an exisiting data file into Pandas</h3>

In [13]:
pwd #know file location of where the code is located

'd:\\VSCode\\MLANDDS\\MLandDS'

In [14]:
ls

 Volume in drive D is HardDrive1
 Volume Serial Number is 0A43-0423

 Directory of d:\VSCode\MLANDDS\MLandDS

10/05/2023  10:42 PM    <DIR>          .
10/05/2023  10:42 PM    <DIR>          ..
10/05/2023  10:42 PM    <DIR>          DATA
10/05/2023  10:42 PM             8,292 numpyBasics.ipynb
10/05/2023  10:42 PM            21,468 pandasBasics.ipynb
10/05/2023  10:42 PM                11 README.md
               3 File(s)         29,771 bytes
               3 Dir(s)  553,824,104,448 bytes free


In [17]:
df = pd.read_csv('DATA/tips.csv')
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251
...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17


In [19]:
df.columns #to grab columns

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size',
       'price_per_person', 'Payer Name', 'CC Number', 'Payment ID'],
      dtype='object')

In [21]:
df.index #values of index

RangeIndex(start=0, stop=244, step=1)

In [22]:
df.head() #reports back the first 5 rows unless clarify otherwise

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251


In [23]:
df.tail() #last five rows unless calrify otherwise

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.0,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.0,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17
243,18.78,3.0,Female,No,Thur,Dinner,2,9.39,Michelle Hardin,3511451626698139,Thur672


In [24]:
df.info() #gives information on the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   sex               244 non-null    object 
 3   smoker            244 non-null    object 
 4   day               244 non-null    object 
 5   time              244 non-null    object 
 6   size              244 non-null    int64  
 7   price_per_person  244 non-null    float64
 8   Payer Name        244 non-null    object 
 9   CC Number         244 non-null    int64  
 10  Payment ID        244 non-null    object 
dtypes: float64(3), int64(2), object(6)
memory usage: 21.1+ KB


In [25]:
df.describe() #gives statistical information

Unnamed: 0,total_bill,tip,size,price_per_person,CC Number
count,244.0,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,7.888197,2563496000000000.0
std,8.902412,1.383638,0.9511,2.914234,2369340000000000.0
min,3.07,1.0,1.0,2.88,60406790000.0
25%,13.3475,2.0,2.0,5.8,30407310000000.0
50%,17.795,2.9,2.0,7.255,3525318000000000.0
75%,24.1275,3.5625,3.0,9.39,4553675000000000.0
max,50.81,10.0,6.0,20.27,6596454000000000.0


Transpose changes from vertical to horizontal columns/index, and vice versa

In [26]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
total_bill,244.0,19.78594,8.902412,3.07,13.3475,17.795,24.1275,50.81
tip,244.0,2.998279,1.383638,1.0,2.0,2.9,3.5625,10.0
size,244.0,2.569672,0.9510998,1.0,2.0,2.0,3.0,6.0
price_per_person,244.0,7.888197,2.914234,2.88,5.8,7.255,9.39,20.27
CC Number,244.0,2563496000000000.0,2369340000000000.0,60406790000.0,30407310000000.0,3525318000000000.0,4553675000000000.0,6596454000000000.0


<h3>How to retrieve data from a DataFrame, selection, creating, adjusting</h3>

<h4>Information base on Columns</h4>

In [27]:
df['total_bill'] #returns the total_bill column

0      16.99
1      10.34
2      21.01
3      23.68
4      24.59
       ...  
239    29.03
240    27.18
241    22.67
242    17.82
243    18.78
Name: total_bill, Length: 244, dtype: float64

In [29]:
type(df['total_bill']) #it passes a Panda Series

pandas.core.series.Series

In [35]:
mycols = ['total_bill','tip']
df[mycols]
#this is why df[['total_bill','tip']] #also returns a DataFrame

Unnamed: 0,total_bill,tip
0,16.99,1.01
1,10.34,1.66
2,21.01,3.50
3,23.68,3.31
4,24.59,3.61
...,...,...
239,29.03,5.92
240,27.18,2.00
241,22.67,2.00
242,17.82,1.75


In [36]:
type(df[mycols]) #returns a DataFrame, because it is now more than one series


pandas.core.frame.DataFrame

If you want to get a DataFrame return but you are only asking for one column, simply do
a list inside so df[[name]]

In [37]:
type(df[['total_bill']])

pandas.core.frame.DataFrame

In [39]:
df['tip'] + df['total_bill'] #Same idea as Numpy

0      18.00
1      12.00
2      24.51
3      26.99
4      28.20
       ...  
239    34.95
240    29.18
241    24.67
242    19.57
243    21.78
Length: 244, dtype: float64

We can create new columns or features by doing such operations

In [41]:
df['tip_percentage'] = 100*df['tip']/df['total_bill']
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,tip_percentage
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,5.944673
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,16.054159
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,16.658734
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,13.978041
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,14.680765


Use Numpy universal functions to make the data clearer, for example percentage does not need to be so many digits.

In [43]:
df['tip_percentage'] = np.round(100*df['tip']/df['total_bill'],2) #round to 2 decimal places,
#creating a 'new' column, if it is already in place, it just overrides it
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,tip_percentage
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,5.94
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,16.05
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,16.66
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,13.98
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,14.68


<h3>To Drop Columns</h3>

Default axis is 0 so if we want to drop a column with drop(), we need to clarify the axis to be 1

0 - rows
1 - columns

so to drop a row, we can do axis=0, or leave it as default, to drop a column
just clarify axis=1

<h3>To Permanently Drop</h3>
We need to set inplace to True, otherwise, if we just dro drop, the DataFrame only does it for the moment, but afterwise, if called, it will still have such columns

another way is to just set the DataFrame to the DataFrame that was drop with an "=" sign

In [44]:
df = df.drop('tip_percentage',axis=1) #is the same as df.drop('tip_percentage',axis=1,inplace=true)

<h4>Information base on Index</h4>