## Pandas

### We use pandas because it makes working with structured data (like tables, spreadsheets, and time series) in Python fast, easy, and powerful.

In [107]:
import pandas as pd
import numpy as np

#### there are 2 most important data structures in pandas - Series and Dataframe
#### You may also find it easier to import Series and DataFrame into the local namespace since they are so frequently used:
#### from pandas import Series, DataFrame

## Series 

#### A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) of the same type and an associated array of data labels ,  called its index. The simplest Series is formed from only an array of data . It consists of elements of similar datatypes


In [108]:
obj=pd.Series([1,2,3,4])
obj

0    1
1    2
2    3
3    4
dtype: int64

In [109]:
obj.array # it attributes to PandasArray
obj.index #to get the index

RangeIndex(start=0, stop=4, step=1)

In [110]:
# create a Series with an index identifying each data point with a label

obj1=pd.Series([4,7,-5,3],index=['a','b','c','d'])
obj1.index

# obj1['d'] # it returns the value on this particular index
obj1[['a','c','d']] # Here ['a','c','d'] is interpreted as a list of indices, even though it contains strings instead of integers

a    4
c   -5
d    3
dtype: int64

In [111]:
obj1[obj1>0] # filtering 

# we can also perform operation like multiplication addition etc

obj1*2 

# operation between 2 different Pandas Series

obj2=pd.Series([12,9,4,0],index=['a','b','c','e'])

obj1+obj2

a    16.0
b    16.0
c    -1.0
d     NaN
e     NaN
dtype: float64

In [112]:
np.exp(obj2) # takes the exponent of the list

a    162754.791419
b      8103.083928
c        54.598150
e         1.000000
dtype: float64

In [113]:
# dictionary to series 

data={'India':1,'China':2,'US':3} 
obj3=pd.Series(data)

# we can overwrite the keys by passing an index with dictionary keys

new_index=['India','Uttarakhand','Gujrat'] # no value for Uttarakhand and Gujrat was found therefore they'll display NaN 
obj3=pd.Series(data,index=new_index) #instead of this we could also have done obj3.index=[] this format
obj3.to_dict() # it converts the list to dictionary

{'India': 1.0, 'Uttarakhand': nan, 'Gujrat': nan}

In [115]:
# we can assign labels for pandas series

obj4=pd.Series({"BMW":1,"Mercedes":2,"Pagani":3})
obj4.name="Ranking" # assigning a label for the Series 
obj4.index.name="Car brands" # assigning a label for the indices

## Dataframe

#### A DataFrame represents a rectangular table of data and contains an ordered, named collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index

#### There are many ways to construct a DataFrame, though one of the most common is from a dictionary of equal-length lists or NumPy arrays:


In [116]:
# Differnce between Series and DataFrames :

# Concept	                          Series	                                               DataFrame
# Definition                A one-dimensional labeled array                     A two-dimensional labeled data structure (like a table)
# Analogy                   Like one column in Excel                            Like the whole Excel sheet (rows + columns)
# Data Structure            1D                                                  2D (rows + columns)
# Dtype                     onlyone dtype                                       Each column has it's own dtype

In [117]:
data={"states":["Jammu and Kashmir","Uttarakhand","Himachal Pradesh","Punjab"],"year":[2008,2009,2010,2011],"pop":[1.5,1.7,2.6,2.9]}
df=pd.DataFrame(data)
df

Unnamed: 0,states,year,pop
0,Jammu and Kashmir,2008,1.5
1,Uttarakhand,2009,1.7
2,Himachal Pradesh,2010,2.6
3,Punjab,2011,2.9


In [118]:
# when working with large dataframes sometimes we just want few of the rows 

df.head() # displays first 5 rows
df.tail() # displays last 5 rows 

Unnamed: 0,states,year,pop
0,Jammu and Kashmir,2008,1.5
1,Uttarakhand,2009,1.7
2,Himachal Pradesh,2010,2.6
3,Punjab,2011,2.9


In [119]:
# Specifying the order of columns

pd.DataFrame(data,columns=["year","states","pop","debt"]) # debt will show Nan as theresno value regarding it

# to access columns

df.columns

# A column in a DataFrame can be retrieved as a Series either by dictionary-like notation or by using the dot attribute notation :

df["states"]
df.year

# frame2[column] works for any column name, but frame2.column works only when the column name is a valid Python variable name and does not conflict with any of the method names in DataFrame . For example, if a column’s name contains whitespace or symbols other than underscores, it cannot be accessed with the dot attribute method.

0    2008
1    2009
2    2010
3    2011
Name: year, dtype: int64

### iloc , loc 

#### loc and iloc are two key methods in pandas used for data selection and indexing. Both are used to access rows and columns of a DataFrame, but they differ in how they handle indexing:

### 1.loc[] — Label-based indexing : Works with row and column labels. You use the labels (the index or column names) to access data. Includes the end point when slicing (i.e., it’s inclusive). 
### 2.iloc[] — Position-based indexing : Works with integer positions (like list indices). You use integer values (0-based index) to access data. Does not include the end point when slicing (i.e., it’s exclusive).

In [None]:
df1=pd.DataFrame({"States":["Uttarakhand","Maharashtra","Sikkim"],"Countries":["India","United States","Russia"],"population(millions)":[1442,69.3,144]},index=["A","B","C"])

df1.loc["C"] # it'll give the row whose index is labelled C

df1.loc["C","Countries"]

df1.loc[["A","B"],["Countries","population(millions)"]] # targeting rows and columns simultaneously

# to select country with population greater than 100 million

df1[df1["population(millions)"]>100]

df1.iloc[1] # it'll give the row whose index is 1

Unnamed: 0,States,Countries,population(millions)
A,Uttarakhand,India,1442.0
C,Sikkim,Russia,144.0


In [None]:
# Columns can be modified by assignment

df1["population(millions)"]=np.arange(100.,190,30) # we can also assign a constant
df1

Unnamed: 0,States,Countries,population(millions)
A,Uttarakhand,India,100.0
B,Maharashtra,United States,130.0
C,Sikkim,Russia,160.0


In [None]:
val = pd.Series([-1.2, -1.5, -1.7], index=["two", "four", "five"])
df1["population(millions)"]=val
df1

Unnamed: 0,States,Countries,population(millions)
A,Uttarakhand,India,
B,Maharashtra,United States,
C,Sikkim,Russia,


In [None]:
# column of boolean values where the country column equals India

df1["is_india"]=df1["Countries"]=="India"
df1

Unnamed: 0,States,Countries,population(millions),is_india
A,Uttarakhand,India,,True
B,Maharashtra,United States,,False
C,Sikkim,Russia,,False


In [None]:
# deleting a column

# drop() and del

# Feature	                      drop()	                                    del
# Type	                          pandas method	                                Python statement
# Works on	                      Rows and columns	                            Columns only
# Modifies in place?	          Optional (inplace=True)	                    Always in place
# Returns a new object?	          Yes (by default)	                            No (returns nothing)
# Can delete multiple?            Yes (e.g. df.drop(['A', 'B'], axis=1))      	No (one at a time)
# Typical usage	                  Flexible, safe deletion	                    Quick, permanent deletion of column



df1=pd.DataFrame(df)
# df1=df1.drop("is_india",axis=1,inplace="True")
# del df1["is_india"]

In [None]:
# Transpose (swap rows and columns)

# You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array:

df1.T

Unnamed: 0,0,1,2,3
states,Jammu and Kashmir,Uttarakhand,Himachal Pradesh,Punjab
year,2008,2009,2010,2011
pop,1.5,1.7,2.6,2.9


In [None]:
# Possible data inputs to the DataFrame constructor

# 2D    -    ndarray A matrix of data, passing optional row and column labels

# Dictionary of arrays, lists, or tuples     -    Each sequence becomes a column in the DataFrame; all sequences must be the same length

# NumPy structured/record array    -    Treated as the “dictionary of arrays” case

# Dictionary of Series    -    Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed 

# Dictionary of dictionaries    -    Each inner dictionary becomes a column; keys are unioned to form the row index as in the “dictionary of Series” case

# List of dictionaries or Series    -    Each item becomes a row in the DataFrame; unions of dictionary keys or Series indexes become the DataFrame’s column labels

# List of lists or tuples    -    Treated as the “2D ndarray” case

# Another DataFrame    -    The DataFrame’s indexes are used unless different ones are passed

# NumPy MaskedArray    -    Like the “2D ndarray” case except masked values are missing in the DataFrame result

In [None]:
d=np.array(df)
d

array([['Jammu and Kashmir', 2008, 1.5],
       ['Uttarakhand', 2009, 1.7],
       ['Himachal Pradesh', 2010, 2.6],
       ['Punjab', 2011, 2.9]], dtype=object)

In [None]:
# reindexing (Calling reindex on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present)

df2=pd.Series(["China","Russia","India","United states","Australia"],index=["D","C","A","E","B"])
df3=df.reindex(["A","B","C","D","E"])
df3

Unnamed: 0,states,year,pop
A,,,
B,,,
C,,,
D,,,
E,,,


In [None]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
obj3.reindex(np.arange(9), method="ffill") # forward fill

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
6    yellow
7    yellow
8    yellow
dtype: object