## Numpy vs. Pandas 

* Pandas is built on top of NumPy and provides a higher-level interface for data analysis, with more advanced data manipulation tools and data handling capabilities.
* NumPy provides multi-dimensional arrays, while **Pandas offers two primary data structures, Series (1D) and DataFrame (2D)**
*  NumPy arrays are indexed by integers, while Pandas provides more flexible indexing options, such as labels and integer-based indexing
* **NumPy does not have built-in support for missing data, whereas Pandas provides tools for handling missing data, such as filling or dropping missing values**

In [1]:
import pandas as pd
import numpy as np

## Series 

* In Pandas, a Series is a one-dimensional labeled array capable of holding any data type (integer, string, float, Python objects, etc.)
*  It is similar to a column in a spreadsheet or a SQL table.
* A Pandas Series consists of two parts: the data and the index
* The index is a sequence of labels that are used to uniquely identify the elements of the Series.
* The data can be a NumPy array, Python list, or any other array-like object.

### Different ways to create Series 

In [2]:
# Creating pandas series using list
# Series s has default integer index from 0 to 4 and contains the values 1, 2, 3, 4, 5.
data = [1,2,3,4,5]
s = pd.Series(data)
s

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [3]:
# Another way to create series using the 'List'
s = pd.Series([12,-4, 7,9]) 
s

0    12
1    -4
2     7
3     9
dtype: int64

In [4]:
# Here we have created the series using the array
arr = np.array([1,2,3,4,5])
s = pd.Series(arr)
print(s)                   

0    1
1    2
2    3
3    4
4    5
dtype: int32


#### Creating stacked series 

In [8]:
# Created a dataframe using the array
fram1 = pd.DataFrame(np.arange(9).reshape(3,3) , index= ['white','black','red'] , columns = ['ball', 'pen','pencil'])
fram1

Unnamed: 0,ball,pen,pencil
white,0,1,2
black,3,4,5
red,6,7,8


In [9]:
type(fram1) 
# Here we can notice that it is dataframe object

pandas.core.frame.DataFrame

In [10]:
# Created a staked series using the dataframe
ser = fram1.stack()
ser

white  ball      0
       pen       1
       pencil    2
black  ball      3
       pen       4
       pencil    5
red    ball      6
       pen       7
       pencil    8
dtype: int32

In [11]:
# Here we can notice that it is the series object
type(ser) 

pandas.core.series.Series

**Unstaking (pandas series -> pandas dataframe) a pandas series can be done in 2 ways:**
* 1 : column of the data frame would go back to its right place
* 0 : transpose of the initial dataframe

In [12]:
ser.unstack(1)

Unnamed: 0,ball,pen,pencil
white,0,1,2
black,3,4,5
red,6,7,8


In [13]:
# Here we have interchanged the index with the column labels
ser.unstack(0)

Unnamed: 0,white,black,red
ball,0,3,6
pen,1,4,7
pencil,2,5,8


### Specifying the index of the series 

In [14]:
data = [1,2,3,4,5]
index = ['a','b', 'c','d','e']
s = pd.Series(data , index = index)
s

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [15]:
s['b']

2

*  when you access the Series using its index, you can use these labels to retrieve the corresponding values.

In [16]:
d = pd.Series([12, -4,7, 9], index = ['a', 'b' , 'c' ,'d']) # Another way to assign index
d

a    12
b    -4
c     7
d     9
dtype: int64

In [17]:
# With this command we checked whether the index of the series is unique or not
d.index.is_unique  

True

In [18]:
# Index of the series is not unique
serd = pd.Series(np.arange(7) , index= ['white', 'white', 'blue', 'green' , 'green', 'yellow' , 'yellow'])
serd , serd.index.is_unique

(white     0
 white     1
 blue      2
 green     3
 green     4
 yellow    5
 yellow    6
 dtype: int32,
 False)

In [19]:
ser = pd.Series([2,5,7,4], index = ['one', 'two' , 'three', 'four'])
ser

one      2
two      5
three    7
four     4
dtype: int64

In [21]:
ser.reindex(['two','four','one','five','six'])

two     5.0
four    4.0
one     2.0
five    NaN
six     NaN
dtype: float64

In [22]:
ser

one      2
two      5
three    7
four     4
dtype: int64

* This reindex thing is not permanently applied to the series 'ser' 
* Here by using the list and index values, we can extract data from the series as we want
* If for a particular index value (the index values which we have passed during the reindex phase) data is not there in the series then we get the value as "nan" for those index values

In [23]:
ser3 = pd.Series([1,5,6,3] , index = [0,3,5,6])
ser3

0    1
3    5
5    6
6    3
dtype: int64

In [24]:
# Forward fill
ser3.reindex(range(6) , method = 'ffill')

0    1
1    1
2    1
3    5
4    5
5    6
dtype: int64

* it reindexes the ser3 Series using a new index that consists of the integers 0 through 5, and fills in any missing values with the most recent non-missing value using the forward fill method.

In [25]:
# Backward filling
ser3.reindex(range(6) , method = 'bfill')

0    1
1    5
2    5
3    5
4    6
5    6
dtype: int64

* It is called backward filling
* the missing values in the Series will be filled with the next non-missing value that occurs after the missing value.

### Series data extracted based on index 

In [26]:
d

a    12
b    -4
c     7
d     9
dtype: int64

In [27]:
# Extracting values of the series
d.values 

array([12, -4,  7,  9], dtype=int64)

In [28]:
# Extracting index of the series
d.index 

Index(['a', 'b', 'c', 'd'], dtype='object')

In [29]:
# values extraction based on the index
print(d[2] , d['b']) 

7 -4


* Here we have observed that by default all series have the index as [0,1,2,3,...]. But if we create lables for the index then we would be able to extract data from the series in 2 ways a) Based on the default labels of index b) based on the customised labels of index 

In [30]:
# Data extraction based on the index
# When we pass the range then we get the series (with index) not just the values
d[0:2] 

a    12
b    -4
dtype: int64

* Through this we can select multiple elements from the series
* Here we need to pass the lower bound (inclusive) and the upper bound (non inclusive) of the default index. 

In [31]:
# Data extraction based on the labels of index
# Here we have passed a list of index values
d[['b', 'c']] 

b   -4
c    7
dtype: int64

* Through this we can extract multiple data from the series by specifying the list of labels 

In [32]:
# Data extraction based on the data values
# If the value of the series belongs to the passed list, we only get only those elements from the series
d.isin([0,10,12]) 

a     True
b    False
c    False
d    False
dtype: bool

In [33]:
# Data extraction based on the data values
d[d.isin([0,10,12])] 

a    12
dtype: int64

In [35]:
# Extracted all the null data points from the series
d[d.isnull()] 

Series([], dtype: int64)

In [42]:
e = d.copy()
e

a    12.0
b    -4.0
c     7.0
d     9.0
dtype: float64

In [43]:
e[0] = np.nan

In [44]:
e

a    NaN
b   -4.0
c    7.0
d    9.0
dtype: float64

In [46]:
e[e.isnull()]

a   NaN
dtype: float64

In [47]:
# Extracted all the not null data points
e[e.notnull()] 

b   -4.0
c    7.0
d    9.0
dtype: float64

In [48]:
# Here we understood that how to get filered values based on the index of the dataframe
serd['white'] 

white    0
white    1
dtype: int32

### Series data extraction based on 'Condition' 

In [49]:
# Filtering based on data values
d > 0 

a     True
b    False
c     True
d     True
dtype: bool

* This provides 2 columns a) Index b) boolion (if data satisfies the condition then it is assigned "True" otherwise "False" 

In [50]:
# It only provides those data points which satisfies the mentioned condition
d[d>0] 

a    12.0
c     7.0
d     9.0
dtype: float64

* Here we are filtering based on the data values

### Modify series data : Add , update or remove elements from the series

#### Add data to the series 

In [51]:
d

a    12.0
b    -4.0
c     7.0
d     9.0
dtype: float64

In [52]:
# Here we have added another data point in the series which have index value = 'e' & data value = 16
d['e'] = 16
d     

a    12.0
b    -4.0
c     7.0
d     9.0
e    16.0
dtype: float64

#### Update data in the series 

In [53]:
d

a    12.0
b    -4.0
c     7.0
d     9.0
e    16.0
dtype: float64

In [54]:
# Here we can notice that we have changed the second element of the series from -4 to 0
d[1] = 0 
print(d)

a    12.0
b     0.0
c     7.0
d     9.0
e    16.0
dtype: float64


#### Remove data from the series 

In [55]:
ser3 = pd.Series([1,5,6,3] , index = [0,3,5,6])
ser3

0    1
3    5
5    6
6    3
dtype: int64

In [56]:
# Delete value based on index (label or default)
ser3.drop(3) 

0    1
5    6
6    3
dtype: int64

In [57]:
# We can also pass a list of index values, corresponding to which the data is dropped from the series
ser3.drop([3,5]) 

0    1
6    3
dtype: int64

## Operations and mathematical functions applied on series 

* Operations such as (+, - , * , /) and other mathematical functions that are applicable to numpy array can also be extended to series as well 

In [58]:
d

a    12.0
b     0.0
c     7.0
d     9.0
e    16.0
dtype: float64

In [59]:
# All data points in the series 'd' is divided by 2
d/2 

a    6.0
b    0.0
c    3.5
d    4.5
e    8.0
dtype: float64

In [61]:
# Here we can notice that the numpy function (like log) have applied to a series as well.
np.log(d) 

a    2.484907
b        -inf
c    1.945910
d    2.197225
e    2.772589
dtype: float64

In [62]:
ser = pd.Series([10,20,30,40,50])
ser

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [63]:
# Index corresponding to the maximum value of the series
max_index = ser.idxmax()
max_index

4

* idxmax : 'index maximum'
* It gives the index of the maximum value in the series

In [64]:
min_index = ser.idxmin()
min_index

0

## Evaluation of Series Values

In [66]:
serd = pd.Series([1,0,2,1,2,3] , index = ['white', 'white' , 'blue', 'green' , 'green' , 'yellow'])
serd

white     1
white     0
blue      2
green     1
green     2
yellow    3
dtype: int64

In [67]:
# to see unique values from the series
serd.unique() 

array([1, 0, 2, 3], dtype=int64)

In [68]:
# To see the frequency of various different values in the series
serd.value_counts() 

1    2
2    2
0    1
3    1
Name: count, dtype: int64

## Operations between series 

### Addition, subtraction etc. between 2 series

In [69]:
mydict2 = {'red' : 400, 'yellow' : 1000 , 'black' : 700}
mydict1 = {'red' : 2000, 'blue' : 1000 , 'yellow' : 500, 'orange': 1000}

series1 = pd.Series(mydict1)
series2 = pd.Series(mydict2)

print(series1 + series2)

black        NaN
blue         NaN
orange       NaN
red       2400.0
yellow    1500.0
dtype: float64


**Here we can observe the following things:**
1) In the resulting series we get the union of the index of both the series \
2) Indexes for which we have data points in both the series, data corresponding to thoses indexs would be added up otherwise there would be NAN value  


### Merging 2 series 

#### Concatination of 2 series 

In [70]:
ser1 = pd.Series(np.random.rand(4) , index =[1,2,3,4])
ser2 = pd.Series(np.random.rand(4), index = [5,6,7,8])
ser1 , ser2

(1    0.469971
 2    0.679486
 3    0.615407
 4    0.570389
 dtype: float64,
 5    0.246939
 6    0.960072
 7    0.541116
 8    0.138957
 dtype: float64)

In [71]:
# It is like both series are vertically staked
pd.concat([ser1,ser2])

1    0.469971
2    0.679486
3    0.615407
4    0.570389
5    0.246939
6    0.960072
7    0.541116
8    0.138957
dtype: float64

In [72]:
ser3 = pd.Series(np.random.rand(4))
ser4 = pd.Series(np.random.rand(4))
ser3, ser4

(0    0.731608
 1    0.340060
 2    0.845082
 3    0.944695
 dtype: float64,
 0    0.289313
 1    0.245228
 2    0.171814
 3    0.366750
 dtype: float64)

In [73]:
pd.concat([ser3,ser4])

0    0.731608
1    0.340060
2    0.845082
3    0.944695
0    0.289313
1    0.245228
2    0.171814
3    0.366750
dtype: float64

* Here we can see that this function just appended the data without changing the index value of both the series
* By default it appends the data vertically
* But we can control whether series would be vertically staked or horizontally staked

In [74]:
# Vertically staked, rows increased
# The resulting object would still be the series
pd.concat([ser1,ser2], axis=0)

1    0.469971
2    0.679486
3    0.615407
4    0.570389
5    0.246939
6    0.960072
7    0.541116
8    0.138957
dtype: float64

In [75]:
# Horizontally staked, new column added 
# The resulting object woule be a dataframe
pd.concat([ser1,ser2], axis =1) 

Unnamed: 0,0,1
1,0.469971,
2,0.679486,
3,0.615407,
4,0.570389,
5,,0.246939
6,,0.960072
7,,0.541116
8,,0.138957


* Here we have appended the data horizontally 

In [76]:
pd.concat([ser3,ser4], axis =1 )  

Unnamed: 0,0,1
0,0.731608,0.289313
1,0.34006,0.245228
2,0.845082,0.171814
3,0.944695,0.36675


* Because the index was same in both the series
* Also here 0 & 1 are the keys

In [77]:
# Here we mentioned the colum names
pd.concat([ser3,ser4], axis =1 , keys=['ser3', 'ser4'])  

Unnamed: 0,ser3,ser4
0,0.731608,0.289313
1,0.34006,0.245228
2,0.845082,0.171814
3,0.944695,0.36675


In [78]:
type(pd.concat([ser3,ser4], axis =1 , keys=['ser3', 'ser4']))

pandas.core.frame.DataFrame

* The reason why pd.concat() uses the key argument instead of columns is because the key parameter is not only used for specifying column names when concatenating pandas series or dataframes, but it also serves a more general purpose of identifying the concatenated objects.
* The key parameter is actually used to create a hierarchical index for the resulting concatenated object

#### Combining 2 series 

In [79]:
ser1 = pd.Series(np.random.rand(5) , index=[1,2,3,4,5])
ser2 = pd.Series(np.random.rand(4), index= [2,4,5,6])
ser1, ser2

(1    0.817486
 2    0.629421
 3    0.089682
 4    0.630006
 5    0.028815
 dtype: float64,
 2    0.935462
 4    0.369879
 5    0.106247
 6    0.741723
 dtype: float64)

In [80]:
ser1.combine_first(ser2)

1    0.817486
2    0.629421
3    0.089682
4    0.630006
5    0.028815
6    0.741723
dtype: float64

* Here we have combined the 2 series in such a manner that we get the value if it either available in any of the series. But if the non null value is available in both the series then it would take value from the first series 'ser1' 
* Here we can notice that although data corresponding to the index number 2 was in both the series. But while combining them it gives priority to the series 1 

In [81]:
ser2.combine_first(ser1)

1    0.817486
2    0.935462
3    0.089682
4    0.369879
5    0.106247
6    0.741723
dtype: float64

In [82]:
# in this way we can combine data from both the series based on groups
ser1[:3].combine_first(ser2[:3]) 

1    0.817486
2    0.629421
3    0.089682
4    0.369879
5    0.106247
dtype: float64

## Working with Null Values 

In [83]:
s2 = pd.Series([5,-3, np.NaN, 14]) # This is how we will create a series with the null data
s2

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

### Understanding null values 

*  NaN (Not a Number) is a special value used to represent undefined or unrepresentable values in calculations.
*  By using NaN to represent undefined or unrepresentable values in calculations, Python and other programming languages provide a way to handle these cases without causing errors or unexpected behavior in a program.

Below mentioned the various possible scenarious when NaN is created:

In [84]:
# 1 Undefined cases like divide a number by zero, take squareroot of negative number , take log of negative or zero values etc.
s = pd.Series([10,15,-1,15])
d = np.sqrt(s)
d

  result = getattr(ufunc, method)(*inputs, **kwargs)


0    3.162278
1    3.872983
2         NaN
3    3.872983
dtype: float64

In [85]:
# 2 Missing or incomplete data
data = np.array([1,2,np.nan,4,5])
data

array([ 1.,  2., nan,  4.,  5.])

In [None]:
colors = ['red', 'yellow' ,'orange' , 'blue' , 'green']

### Replacing null values 

In [87]:
ser = pd.Series([1,3,np.nan,4,6,np.nan,3])
ser

0    1.0
1    3.0
2    NaN
3    4.0
4    6.0
5    NaN
6    3.0
dtype: float64

In [None]:
# Replaced null values from the pandas series with 0
ser.replace(np.nan , 0)

## Working with intervals 

### Create various intervals 

In [88]:
bins = [0,25, 50,75,100]
results = pd.Series([10, 20, 30, 40, 50, 60, 70, 80, 90])

In [91]:
intervals = pd.cut(results,bins)
intervals

0      (0, 25]
1      (0, 25]
2     (25, 50]
3     (25, 50]
4     (25, 50]
5     (50, 75]
6     (50, 75]
7    (75, 100]
8    (75, 100]
dtype: category
Categories (4, interval[int64, right]): [(0, 25] < (25, 50] < (50, 75] < (75, 100]]

In [92]:
print(intervals , type(intervals))

0      (0, 25]
1      (0, 25]
2     (25, 50]
3     (25, 50]
4     (25, 50]
5     (50, 75]
6     (50, 75]
7    (75, 100]
8    (75, 100]
dtype: category
Categories (4, interval[int64, right]): [(0, 25] < (25, 50] < (50, 75] < (75, 100]] <class 'pandas.core.series.Series'>


* Here we can notice that we have converted series data into intervals based on the bins. Here each data point from the series is assigned to a specific bin (interval) 
* It is the categorical series

In [93]:
# Here without defining bins it have automatically created 5 intervals based on the data of the series
pd.cut(results, 5)

0    (9.92, 26.0]
1    (9.92, 26.0]
2    (26.0, 42.0]
3    (26.0, 42.0]
4    (42.0, 58.0]
5    (58.0, 74.0]
6    (58.0, 74.0]
7    (74.0, 90.0]
8    (74.0, 90.0]
dtype: category
Categories (5, interval[float64, right]): [(9.92, 26.0] < (26.0, 42.0] < (42.0, 58.0] < (58.0, 74.0] < (74.0, 90.0]]

In [94]:
pd.qcut(results , 5)

0    (9.999, 26.0]
1    (9.999, 26.0]
2     (26.0, 42.0]
3     (26.0, 42.0]
4     (42.0, 58.0]
5     (58.0, 74.0]
6     (58.0, 74.0]
7     (74.0, 90.0]
8     (74.0, 90.0]
dtype: category
Categories (5, interval[float64, right]): [(9.999, 26.0] < (26.0, 42.0] < (42.0, 58.0] < (58.0, 74.0] < (74.0, 90.0]]

* The pd.qcut() and pd.cut() functions in pandas are used for different purposes and produce different results, even when given the same number of bins.
* pd.qcut() is used to split a dataset into quantile intervals, where each interval contains approximately the same number of observations. This means that the interval boundaries are chosen based on the distribution of the data, rather than being evenly spaced.
* pd.cut() is used to split a dataset into equally-spaced intervals or bins
* This means that the interval boundaries are determined based on the minimum and maximum values in the dataset, rather than being determined by the distribution of the data.

### Extracting various objects from the categorical series

In [95]:
# It provides distinct intervals under which the series data is classified
intervals.categories 

AttributeError: 'Series' object has no attribute 'categories'

In [96]:
intervals.codes

AttributeError: 'Series' object has no attribute 'codes'

As each interval is associated with a code {0,1,2,3}. Hence here each data point from the series is associated with the specific code of various bins (intervals)

In [97]:
pd.value_counts(intervals)

(25, 50]     3
(0, 25]      2
(50, 75]     2
(75, 100]    2
Name: count, dtype: int64

Here we can see that how all the values of the series is distributed across intervals

#### Labelling each interval 

In [98]:
results

0    10
1    20
2    30
3    40
4    50
5    60
6    70
7    80
8    90
dtype: int64

In [99]:
bin_names = ["unlikely", "less likely", "likely", "very likely"]

In [100]:
# Here results is the initial series
pd.cut(results , bins , labels = bin_names) 

0       unlikely
1       unlikely
2    less likely
3    less likely
4    less likely
5         likely
6         likely
7    very likely
8    very likely
dtype: category
Categories (4, object): ['unlikely' < 'less likely' < 'likely' < 'very likely']