# Data Analysis with Pandas

<h2> Pandas is a data analysis tool kit which can handle different kinds of data, also handles missing data.</h2>

## Intro to data structures

<ul>
<li><h1>Series</h1></li>
</ul>
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

* syntax:

**import pandas as pd**

**s = pd.Series(data, index=index)**

Here, **data** can be many different things:

<ul>

<li>a Python Data structure like List,Tuple,String,Dict etc</li>

<li>an ndarray</li>

<li>a scalar value (like 5)</li>
</ul>

The passed  **index**  is a list of axis labels.

In [1]:
import pandas as pd
import numpy as np

In [2]:
#passing ndarray(numpy array)
s = pd.Series(np.random.rand(5))
s

0    0.488932
1    0.015735
2    0.263211
3    0.047290
4    0.958560
dtype: float64

In [2]:
s[1:4] #end-1

1    0.334976
2    0.597495
3    0.078041
dtype: float64

In [3]:
s = pd.Series(np.random.rand(5), index=['a','b','c','d','e'])
s

a    0.965017
b    0.473399
c    0.536005
d    0.135199
e    0.877168
dtype: float64

In [4]:
s['a':'d']

a    0.965017
b    0.473399
c    0.536005
d    0.135199
dtype: float64

In [5]:
s[::2]

a    0.965017
c    0.536005
e    0.877168
dtype: float64

In [2]:
s1 = pd.Series(np.random.rand(5), index=[1, 2, 3, 4, 5])
s1

1    0.693413
2    0.096268
3    0.145967
4    0.074061
5    0.547686
dtype: float64

In [5]:
s1[2 : 5]

3    0.145967
4    0.074061
5    0.547686
dtype: float64

In [3]:
s2 = pd.Series(np.random.rand(5), index=[10, 20, 30, 40, 50])
s2

10    0.228150
20    0.282059
30    0.553585
40    0.409423
50    0.216659
dtype: float64

In [4]:
s2[20:50]

Series([], dtype: float64)

In [8]:
s2[10]

0.22814980181226852

In [9]:
#passing list as data
L = pd.Series([12,22,13,43,12])
L

0    12
1    22
2    13
3    43
4    12
dtype: int64

In [3]:
L1=pd.Series([12,22,13,43,12], index=[1,2,3,4,5])
L1

1    12
2    22
3    13
4    43
5    12
dtype: int64

In [10]:
#2-D list
L1=pd.Series([[12,22],[13,43],[12,[1,2]], 100,])
L1

0        [12, 22]
1        [13, 43]
2    [12, [1, 2]]
3             100
dtype: object

# From dict

In [11]:
#Passing dictionary as data to Series
d = {'b': [1,2], 'a': [0,4],'c': [2,5]} #key becomes index of series, values becomes data
ss=pd.Series(d)
ss

b    [1, 2]
a    [0, 4]
c    [2, 5]
dtype: object

In [12]:
ss1=pd.Series(d, index=['b', 'c', 'd', 'a','e']) #NaN=> Not a Number
ss1

b    [1, 2]
c    [2, 5]
d       NaN
a    [0, 4]
e       NaN
dtype: object

### Note

NaN (not a number) is the standard missing data marker used in pandas.

### If data is a scalar value, an index provided. The value will be repeated to match the length of index.

In [11]:
pd.Series(5)

0    5
dtype: int64

In [13]:
sc= pd.Series(5, index= ['a', 'b', 'c', 'd', 'e'])
sc

a    5
b    5
c    5
d    5
e    5
dtype: int64

In [11]:
# passing string as a data to Series
ss3=pd.Series("Python",index= ['a', 'b', 'c', 'd', 'e'])#String is also considered as Scalar value
ss3

a    Python
b    Python
c    Python
d    Python
e    Python
dtype: object

In [10]:
s = pd.Series(np.random.randint(5, 10, size=(5,)),index=['a', 'b', 'c', 'd', 'e'])
s

a    5
b    8
c    9
d    7
e    5
dtype: int32

### Performing Indexing on Series data

In [11]:
print("Accessing first element of series: ", s[0])
print()
print("Accessing first 3 elements of series:\n ",s[:3])
print()
print("Performing boolean hashing on series data where value > meadian",s[s > s.median()])
print()
print("Accessing 4, 3,1 elements of series: \n",s[[4, 3, 1]])
print()
print("Performing exponential formula on series\n", np.exp(s))

Accessing first element of series:  5

Accessing first 3 elements of series:
  a    5
b    8
c    9
dtype: int32

Performing boolean hashing on series data where value > meadian b    8
c    9
dtype: int32

Accessing 4, 3,1 elements of series: 
 e    5
d    7
b    8
dtype: int32

Performing exponential formula on series
 a     148.413159
b    2980.957987
c    8103.083928
d    1096.633158
e     148.413159
dtype: float64


In [12]:
#inserting new value to existing key
s['e'] = 12.0
s

a     5
b     8
c     9
d     7
e    12
dtype: int32

In [13]:
#inserting new key within series
s['f']=23
s

a     5
b     8
c     9
d     7
e    12
f    23
dtype: int64

In [12]:
s4 = pd.Series(np.random.randint(5, 10, size=(5,)),index=['a', 'b', 'c', 'd', 'e'])
s4

a    5
b    9
c    9
d    7
e    5
dtype: int32

In [13]:
s4['f']='Hi'
s4  #In pandas, dtype=object then it will be a string data

a     5
b     9
c     9
d     7
e     5
f    Hi
dtype: object

In [16]:
s4.dtype

dtype('O')

# DataFrame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:
<ul>
<li>Dict of 1D ndarrays, lists, dicts, or Series</li>

<li>2-D numpy.ndarray</li>

<li>Structured or record ndarray</li>


<li>Another DataFrame</li>
</ul>

>**df=pd.DataFrame(data, index=None,columns=None)**
    
    
Along with the data, you can optionally pass <b>index(row labels) and columns (column labels)</b> arguments. If you pass an index and  or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

In [16]:
d = {'one': [1., 2., 3., 4.],
    'two': [4., 3., 2., 1.]}

#Passing dict as data to DataFrame, keys becomes col labels of DF, values become data of DF

dd=pd.DataFrame(d, index=['a','b','c','d'], columns=['one','two','three'])

dd

Unnamed: 0,one,two,three
a,1.0,4.0,
b,2.0,3.0,
c,3.0,2.0,
d,4.0,1.0,


In [19]:
dr1={'a':[12],'b':[34],'c':[55]} #vector values for dict keys

#passing seq value as a data
pd.DataFrame(dr1)

Unnamed: 0,a,b,c
0,12,34,55


In [20]:
dr={'a':12, 'b':34, 'c':55} #scalar values for dict keys

pd.DataFrame(dr, index=[0]) #ValueError: If using all scalar values, you must pass an index

Unnamed: 0,a,b,c
0,12,34,55


In [8]:
dt1={'a':'Hello','b':'overs'} #string values are also considered as scalar values

dft=pd.DataFrame(dt1, index=[0]) #ValueError: If using all scalar values, you must pass an index
dft

Unnamed: 0,a,b
0,Hello,overs


In [21]:
dt={'a':('Hello',), 'b':('overs',)}

dft=pd.DataFrame(dt) #vector values
dft

Unnamed: 0,a,b
0,Hello,overs


In [22]:
#passing data as 2-D list
dt=pd.DataFrame([[10, 20], [30, 40], [50]], index=['a','b','c'], columns=['H1','B1'])
dt

Unnamed: 0,H1,B1
a,10,20.0
b,30,40.0
c,50,


In [26]:
#indexes:            0                          1
data2 = [{'a': 1, 'b': 2,'e':34}, {'a': 5, 'b': 10, 'c': 20}]

#NaN's default datatype is float

#list of dictionaries
d1=pd.DataFrame(data2)
d1

Unnamed: 0,a,b,e,c
0,1,2,34.0,
1,5,10,,20.0


In [23]:
w=pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
w

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


# Performing Slicing on DataFrame

In [10]:
#Accessing data of dataframe by using column labels

w['one']

a    1.0
b    2.0
c    3.0
d    4.0
Name: one, dtype: float64

In [11]:
#Accessing data of dataframe by row labels

w['a']#KeyError: That means we can't access dataframe data using row labels

KeyError: 'a'

In [44]:
w[0] #index no

KeyError: 0

In [48]:
df=pd.DataFrame(np.random.randint(30,70, size=[6, 6]), columns=['a','b','c','d','e','f'])
df

Unnamed: 0,a,b,c,d,e,f
0,67,41,36,36,42,54
1,47,51,63,48,44,37
2,43,62,38,50,40,44
3,63,44,65,38,35,58
4,40,34,30,53,62,45
5,39,40,61,52,69,47


In [49]:
df[['a','c','e']] #accessing with specific columns

Unnamed: 0,a,c,e
0,67,36,42
1,47,63,44
2,43,38,40
3,63,65,35
4,40,30,62
5,39,61,69


In [50]:
df[[1, 3, 4]] #Accessing element of DF by row labels

KeyError: "None of [Int64Index([1, 3, 4], dtype='int64')] are in the [columns]"

In [51]:
#you can perform slicing and Striding on  dataframe rows
df[:6:2]

Unnamed: 0,a,b,c,d,e,f
0,67,41,36,36,42,54
2,43,62,38,50,40,44
4,40,34,30,53,62,45
