# Pandas

### What can we do with Pandas?
- It is used for reading and writing data in many formats.
- It intelligently grab data based on indexing, logic, subsetting and more.
- Handle missing data.
- Adjust and restructure data.

### Series

- It is a data structure in pandas that holds an array of information along with the named index.
- The named index differentiaties this from a simple Numpy array.
- Formal Definition: One-dimensional ndarray with axis labels.

In [1]:
import numpy as np
import pandas as pd

In [2]:
myindexx = ['USA', 'Canada', 'Mexico']
mydataa = [1776, 1867, 1821]

In [3]:
myser = pd.Series(data=mydataa, index=myindexx)

In [5]:
myser

USA       1776
Canada    1867
Mexico    1821
dtype: int64

In [6]:
myser['USA']

np.int64(1776)

In [7]:
ages = {'Sam': 25, 'Frank': 30, 'Dan': 22, 'John': 35}

In [8]:
pd.Series(ages)

Sam      25
Frank    30
Dan      22
John     35
dtype: int64

In [10]:
# Imaginary sales data for 1st and 2nd quarters for global companies
q1 = {'Japan': 80, 'China': 450, 'India': 200, 'USA': 250}
q2 = {'Brazil': 100, 'China': 500, 'India': 210, 'USA': 260}

In [11]:
sales_q1 = pd.Series(q1)
sales_q2 = pd.Series(q2)

In [12]:
sales_q1

Japan     80
China    450
India    200
USA      250
dtype: int64

In [13]:
sales_q2

Brazil    100
China     500
India     210
USA       260
dtype: int64

In [14]:
sales_q2.keys()

Index(['Brazil', 'China', 'India', 'USA'], dtype='object')

In [15]:
sales_q1 + sales_q2

Brazil      NaN
China     950.0
India     410.0
Japan       NaN
USA       510.0
dtype: float64

In [16]:
sales_q1.add(sales_q2, fill_value=0)

Brazil    100.0
China     950.0
India     410.0
Japan      80.0
USA       510.0
dtype: float64

### DataFrames

- A dataframe is a table of columns and rows in pandas that we can easily restructure and filter
- Formal Definition: A group of pandas series objects that share the same index.

In [17]:
np.random.seed(101)
mydata = np.random.randint(0, 101, (4, 3))
myindex = ['CA', 'NY', 'AZ', 'TX']
mycols = ['Jan', 'Feb', 'Mar']

In [18]:
df = pd.DataFrame(data=mydata, index=myindex, columns=mycols)
df

Unnamed: 0,Jan,Feb,Mar
CA,95,11,81
NY,70,63,87
AZ,75,9,77
TX,40,4,63


In [19]:
pwd

'/Users/anurag_warthi/Downloads/Data-Science-Machine-Learning/Data Science/01-DS_Libraries'

In [20]:
# reading files using pandas
df = pd.read_csv('Files/tips.csv')

In [21]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   sex               244 non-null    object 
 3   smoker            244 non-null    object 
 4   day               244 non-null    object 
 5   time              244 non-null    object 
 6   size              244 non-null    int64  
 7   price_per_person  244 non-null    float64
 8   Payer Name        244 non-null    object 
 9   CC Number         244 non-null    int64  
 10  Payment ID        244 non-null    object 
dtypes: float64(3), int64(2), object(6)
memory usage: 21.1+ KB
