# ðŸ“˜ Content from `01-Series.ipynb`


# Series

The first main data type we will learn about for pandas is the Series data type. Let's import Pandas and explore the Series object.

A Series is very similar to a python list (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

In [None]:
pip install pandas seaborn

In [None]:
import pandas as pd
import numpy as np

### Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [None]:
labels = ['a','b','c']
my_list = [10,20,30]
d = {'a':10,'b':20,'c':30}
my_list

** Using Lists**

In [None]:
ser1 = pd.Series(data=my_list)
ser1

0    10
1    20
2    30
dtype: int64

In [None]:
type(ser1)

pandas.core.series.Series

In [None]:
pd.Series(data=my_list,index=['a','b','c'])

a    10
b    20
c    30
dtype: int64

In [None]:
pd.Series(my_list,labels)

a    10
b    20
c    30
dtype: int64

** NumPy Arrays **

In [None]:
import numpy as np
arr = np.array(my_list)

pd.Series(arr)

0    10
1    20
2    30
dtype: int64

In [None]:
pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int64

** Dictionary**

In [None]:
d = {'a':10,'b':20,'c':30}
pd.Series(d)

a    10
b    20
c    30
dtype: int64

### Data in a Series

A pandas Series can hold a variety of object types:

In [None]:
pd.Series(data=labels)

0    a
1    b
2    c
dtype: object

In [None]:
# Even functions (although unlikely that you will use this)
pd.Series([sum,print,len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

## Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:

In [None]:
ser1 = pd.Series(data=[1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])                                   

In [None]:
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [None]:
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])                                   

In [None]:
ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

In [None]:
ser1['USA']

np.int64(1)

Operations are then also done based off of index:

In [None]:
ser1 + ser2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

In [None]:
ser1 = pd.Series([1,2,3,4,5],index = ['USA', 'Germany','USSR', 'Japan','India'])                                   

In [None]:
ser1

USA        1
Germany    2
USSR       3
Japan      4
India      5
dtype: int64

In [None]:
print(ser1.India)
print(ser1['India'])

5
5


In [None]:
ser1.

In [None]:
print(ser1.cumsum())


USA         1
Germany     3
USSR        6
Japan      10
India      15
dtype: int64


In [None]:
print(ser1.shape)

(5,)


In [None]:
print(ser1.min())
print(ser1.max())
print(ser1.median())
print(ser1.mode())

1
5
3.0
0    1
1    2
2    3
3    4
4    5
dtype: int64


In [None]:
ser1.dtype

dtype('int64')

In [None]:
ser2 = pd.Series(data = ['USA', 'Germany','USSR', 'Japan','India'],index = [1,2,3,4,5])
ser2

1        USA
2    Germany
3       USSR
4      Japan
5      India
dtype: object

In [None]:
ser2.dtype

dtype('O')

In [None]:
student_records = {'kiran':67,'kumar':89,'sandy':90,'sanjay':78, 'karthick':45}
stud_ser = pd.Series(student_records)

In [None]:
stud_ser

kiran       67
kumar       89
sandy       90
sanjay      78
karthick    45
dtype: int64

In [None]:
stud_ser['kiran']
stud_ser.sandy

np.int64(90)

In [None]:
stud_ser.mean()

np.float64(73.8)

In [None]:
stud_ser.argmax()

np.int64(2)

# ---
# ðŸ“˜ Content from `03-DataFrames.ipynb`
# ---

# DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

In [None]:
import pandas as pd
import numpy as np

In [None]:
my_lst = [[1,2,3,4,5], [6,7,8,9,10]]

df = pd.DataFrame(my_lst, columns=['col1','col2','col3','col4', 'col5'])
df

Unnamed: 0,col1,col2,col3,col4,col5
0,1,2,3,4,5
1,6,7,8,9,10


In [None]:
my_dict = {"col1":[1,2,3,4,5], "col2":[6,7,8,9,10]}

pd.DataFrame(my_dict)

Unnamed: 0,col1,col2
0,1,6
1,2,7
2,3,8
3,4,9
4,5,10


In [None]:
from numpy.random import randn
np.random.seed(101)

In [None]:
'W X Y Z'.split()

['W', 'X', 'Y', 'Z']

In [None]:
randn(5,4)

array([[ 2.70684984,  0.62813271,  0.90796945,  0.50382575],
       [ 0.65111795, -0.31931804, -0.84807698,  0.60596535],
       [-2.01816824,  0.74012206,  0.52881349, -0.58900053],
       [ 0.18869531, -0.75887206, -0.93323722,  0.95505651],
       [ 0.19079432,  1.97875732,  2.60596728,  0.68350889]])

In [None]:
df = pd.DataFrame(randn(5,4),index=['A', 'B', 'C', 'D', 'E'],columns=['W', 'X', 'Y', 'Z'])

In [None]:
subjects = ['Math', 'Physics', 'Chemistry', 'Biology', 'English']
scores = np.random.randint(50, 101, size=(10, len(subjects)))
student_df = pd.DataFrame(data=scores, columns=subjects)

In [None]:
names = ["Aarav", "Vihaan", "Arjun", "Vivaan", "Aditya", "Rohan", "Karan", "Ishaan", "Sai", "Vikram"]
student_df.index = names
student_df

In [None]:
student_df

Unnamed: 0,Math,Physics,Chemistry,Biology,English
0,92,82,93,100,89
1,95,73,72,71,76
2,62,52,90,67,51
3,75,52,66,100,52
4,74,65,79,56,78
5,92,72,98,70,75
6,83,96,57,58,66
7,79,50,94,90,89
8,66,88,95,89,89
9,85,75,81,61,58


In [None]:
df['W']
# df.W

A    0.302665
B   -0.134841
C    0.807706
D   -0.497104
E   -0.116773
Name: W, dtype: float64

## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [None]:
df['W','Y']

KeyError: ('W', 'Y')

In [None]:
# Pass a list of column names
df[['W','Z']] 

Unnamed: 0,W,Z
A,0.302665,-1.159119
B,-0.134841,0.184502
C,0.807706,0.329646
D,-0.497104,0.484752
E,-0.116773,1.996652


In [None]:
# SQL Syntax (NOT RECOMMENDED!)
df.W

DataFrame Columns are just Series

In [None]:
type(df['W'])

pandas.core.series.Series

In [None]:
df['W'].dtype

dtype('float64')

In [None]:
df['W'] + df['Y']

A   -1.403420
B    0.032064
C    1.446493
D   -1.440510
E    0.121354
dtype: float64

**Creating a new column:**

In [None]:
df['new'] = df['W'] + df['Y']


In [None]:
df

Unnamed: 0,W,X,Y,Z,new
A,0.302665,1.693723,-1.706086,-1.159119,-1.40342
B,-0.134841,0.390528,0.166905,0.184502,0.032064
C,0.807706,0.07296,0.638787,0.329646,1.446493
D,-0.497104,-0.75407,-0.943406,0.484752,-1.44051
E,-0.116773,1.901755,0.238127,1.996652,0.121354


In [None]:
df['sub'] = 'python'
df

Unnamed: 0,W,X,Y,Z,new,sub
A,0.302665,1.693723,-1.706086,-1.159119,-1.40342,python
B,-0.134841,0.390528,0.166905,0.184502,0.032064,python
C,0.807706,0.07296,0.638787,0.329646,1.446493,python
D,-0.497104,-0.75407,-0.943406,0.484752,-1.44051,python
E,-0.116773,1.901755,0.238127,1.996652,0.121354,python


**Removing Columns**

In [None]:
# df.drop('new')
df.drop('new', axis=0)

KeyError: "['new'] not found in axis"

In [None]:
df.drop('new', axis=1)  # default looks into rows , axis='columns'

Unnamed: 0,W,X,Y,Z,sub
A,0.302665,1.693723,-1.706086,-1.159119,python
B,-0.134841,0.390528,0.166905,0.184502,python
C,0.807706,0.07296,0.638787,0.329646,python
D,-0.497104,-0.75407,-0.943406,0.484752,python
E,-0.116773,1.901755,0.238127,1.996652,python


In [None]:
# Not inplace unless specified!
df

Unnamed: 0,W,X,Y,Z,new,sub
A,0.302665,1.693723,-1.706086,-1.159119,-1.40342,python
B,-0.134841,0.390528,0.166905,0.184502,0.032064,python
C,0.807706,0.07296,0.638787,0.329646,1.446493,python
D,-0.497104,-0.75407,-0.943406,0.484752,-1.44051,python
E,-0.116773,1.901755,0.238127,1.996652,0.121354,python


In [None]:
df.drop('new',axis=1,inplace=True)  # permanently drop it

In [None]:
df

Unnamed: 0,W,X,Y,Z,sub
A,0.302665,1.693723,-1.706086,-1.159119,python
B,-0.134841,0.390528,0.166905,0.184502,python
C,0.807706,0.07296,0.638787,0.329646,python
D,-0.497104,-0.75407,-0.943406,0.484752,python
E,-0.116773,1.901755,0.238127,1.996652,python


Can also drop rows this way:

In [None]:
df.drop('E' ,axis=0) # inplace=True # 0 == rows 1 == columns

In [None]:
df

Unnamed: 0,W,X,Y,Z,sub
A,0.302665,1.693723,-1.706086,-1.159119,python
B,-0.134841,0.390528,0.166905,0.184502,python
C,0.807706,0.07296,0.638787,0.329646,python
D,-0.497104,-0.75407,-0.943406,0.484752,python
E,-0.116773,1.901755,0.238127,1.996652,python


In [None]:
df.loc[['B', 'C']][['X','Y']]

Unnamed: 0,X,Y
B,0.390528,0.166905
C,0.07296,0.638787


**Selecting Rows**

In [None]:
df['A']

KeyError: 'A'

In [None]:
df.loc['A']

W      0.302665
X      1.693723
Y     -1.706086
Z     -1.159119
sub      python
Name: A, dtype: object

In [None]:
df.loc[['A','B']]

Unnamed: 0,W,X,Y,Z,sub
A,0.302665,1.693723,-1.706086,-1.159119,python
B,-0.134841,0.390528,0.166905,0.184502,python


Or select based off of position instead of label 

In [None]:
df

Unnamed: 0,W,X,Y,Z,sub
A,0.302665,1.693723,-1.706086,-1.159119,python
B,-0.134841,0.390528,0.166905,0.184502,python
C,0.807706,0.07296,0.638787,0.329646,python
D,-0.497104,-0.75407,-0.943406,0.484752,python
E,-0.116773,1.901755,0.238127,1.996652,python


In [None]:
df.iloc[0]  # index

W      0.302665
X      1.693723
Y     -1.706086
Z     -1.159119
sub      python
Name: A, dtype: object

** Selecting subset of rows and columns **

In [None]:
df.loc['B','X']

np.float64(0.39052784273374097)

In [None]:
df.loc[['B','C'],['X','Y']]

Unnamed: 0,X,Y
B,0.390528,0.166905
C,0.07296,0.638787


In [None]:
df.iloc[[1,2],[1,2]]

Unnamed: 0,X,Y
B,0.390528,0.166905
C,0.07296,0.638787
