### AccelerateAI - Python for Data Science
Introduction to Python Language (Python 3)

In this notebook we will cover the following:
- Data Wrangling with Pandas 
- Advanced Numpy
- Strings and Text

#### 1. Data Wrangling

Data Wrangling process is defined as process in which we do the following: 
1. Discovering: Before we can dive deeply, we must better understand what is in the data,
2. Structuring: This step means organizing the data for better analysis. A single column may turn into several rows, one column may become two.
3. Cleaning: In this step we decide what to do with erroneous data, outliers, missing values etc.
4. Enriching: Exploring what other additional data might be added to help with the analysis
5. Validating: Verifying data consistency, quality, and security.
6. Publishing: Making the data and insights easy to accesses and consume.

In [1]:
import numpy as np
import pandas as pd 

##### 1.1 About the Credit Card Dataset
- Credit Card customer acquisition
- Spend (Transaction) Data
- Payment Information 


In [2]:
customer_df = pd.read_excel("Credit Card Data.xlsx", sheet_name=0)
customer_df.head()

Unnamed: 0,No,Customer,Age,City,Product,Limit,Company,Segment
0,1,A1,47,BANGALORE,Gold,1500000,C1,Self Employed
1,2,A2,56,CALCUTTA,Silver,300000,C2,Salaried_MNC
2,3,A3,30,COCHIN,Platimum,540000,C3,Salaried_Pvt
3,4,A4,22,BOMBAY,Platimum,840084,C4,Govt
4,5,A5,59,BANGALORE,Platimum,420084,C5,Normal Salary


In [3]:
trans_df = pd.read_excel("Credit Card Data.xlsx", sheet_name=1)
trans_df.head()

Unnamed: 0,Sl No:,Customer,Month,Type,Amount
0,1,A1,2004-01-12,JEWELLERY,344054.980813
1,2,A1,2004-01-03,PETRO,935.495203
2,3,A1,2004-01-15,CLOTHES,8687.895474
3,4,A1,2004-01-25,FOOD,341.159711
4,5,A1,2005-01-17,CAMERA,3406.639477


In [4]:
payment_df = pd.read_excel("Credit Card Data.xlsx", sheet_name=2)
payment_df.head()

Unnamed: 0,SL No:,Customer,Month,Amount
0,1,A1,2006-05-15,230847.3
1,2,A1,2005-08-27,1835.124
2,3,A1,2004-03-07,4858.701
3,4,A1,2005-03-01,1360527.0
4,5,A1,2004-02-14,190232.2


In [5]:
# Any missing values?
customer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   No        100 non-null    int64 
 1   Customer  100 non-null    object
 2   Age       100 non-null    int64 
 3   City      100 non-null    object
 4   Product   100 non-null    object
 5   Limit     100 non-null    int64 
 6   Company   100 non-null    object
 7   Segment   100 non-null    object
dtypes: int64(3), object(5)
memory usage: 6.4+ KB


In [6]:
#Let's check if the data type is correct
customer_df.dtypes

No           int64
Customer    object
Age          int64
City        object
Product     object
Limit        int64
Company     object
Segment     object
dtype: object

In [7]:
customer_df.Age.describe()

count    100.000000
mean      46.460000
std       17.816925
min       14.000000
25%       30.000000
50%       47.500000
75%       62.250000
max       75.000000
Name: Age, dtype: float64

######  Age can't be less than 18. In case age is less than 18, let's replace it with mean of age values.

In [8]:
#df.loc[ condition, column]

customer_df.loc[(customer_df.Age < 18),'Age']=customer_df.Age.mean()
customer_df.Age.describe()

count    100.000000
mean      47.393800
std       16.946251
min       18.000000
25%       33.000000
50%       47.500000
75%       62.250000
max       75.000000
Name: Age, dtype: float64

In [9]:
#let's peek into the transactions
trans_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1505 entries, 0 to 1504
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Sl No:    1505 non-null   int64         
 1   Customer  1504 non-null   object        
 2   Month     1502 non-null   datetime64[ns]
 3   Type      1505 non-null   object        
 4   Amount    1502 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 58.9+ KB


In [10]:
# Let's drop the 2 rows with missing values
trans_df.dropna(how="any",axis=0)                                       
trans_df.info()                                            # What's wrong with this?? 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1505 entries, 0 to 1504
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Sl No:    1505 non-null   int64         
 1   Customer  1504 non-null   object        
 2   Month     1502 non-null   datetime64[ns]
 3   Type      1505 non-null   object        
 4   Amount    1502 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 58.9+ KB


In [11]:
trans_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1505 entries, 0 to 1504
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Sl No:    1505 non-null   int64         
 1   Customer  1504 non-null   object        
 2   Month     1502 non-null   datetime64[ns]
 3   Type      1505 non-null   object        
 4   Amount    1502 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 58.9+ KB


In [12]:
trans_df.dtypes

Sl No:               int64
Customer            object
Month       datetime64[ns]
Type                object
Amount             float64
dtype: object

##### How many customers are there?

In [13]:
trans_df.Customer.nunique()

100

##### What are the spend categories? 

In [14]:
trans_df.Type.value_counts()

PETRO           200
CAMERA          161
FOOD            160
AIR TICKET      148
TRAIN TICKET    132
SHOPPING        114
BUS TICKET      101
CLOTHES          95
JEWELLERY        95
RENTAL           76
MOVIE TICKET     76
BIKE             49
AUTO             40
CAR              30
SANDALS          28
Name: Type, dtype: int64

##### Which category has the highest average spend?

In [15]:
trans_df.groupby("Type")['Amount'].mean().sort_values(ascending=False).round(2)

Type
CAR             409143.47
AIR TICKET      254632.19
JEWELLERY       239218.69
BIKE            210701.27
AUTO             27320.76
CLOTHES          25140.16
CAMERA           21420.54
RENTAL           13106.51
BUS TICKET       12559.79
SHOPPING          7394.33
SANDALS           2516.63
MOVIE TICKET      1875.30
TRAIN TICKET      1627.49
PETRO              549.48
FOOD               341.17
Name: Amount, dtype: float64

In [16]:
#Are air tickets this expensive??
trans_df[trans_df['Type'] == 'AIR TICKET'].head()

Unnamed: 0,Sl No:,Customer,Month,Type,Amount
10,11,A11,2005-02-23,AIR TICKET,195424.058828
29,30,A30,2006-05-10,AIR TICKET,73734.048787
40,41,A41,2005-05-27,AIR TICKET,125409.752876
55,56,A56,2006-02-03,AIR TICKET,237754.648092
70,71,A69,2005-01-17,AIR TICKET,167393.969987


In [17]:
trans_df[trans_df['Type'] == 'JEWELLERY'].head()

Unnamed: 0,Sl No:,Customer,Month,Type,Amount
0,1,A1,2004-01-12,JEWELLERY,344054.980813
15,16,A16,2006-03-23,JEWELLERY,98182.18094
19,20,A20,2006-04-30,JEWELLERY,348718.392649
45,46,A46,2005-06-01,JEWELLERY,322625.842279
60,61,A61,2006-07-03,JEWELLERY,167214.229458


##### What is the average monthly spend by product categories? 

In [18]:
#lets create an actual month column from timestamp
trans_df.rename(columns={'Month':'TransDate'}, inplace=True)

In [19]:
trans_df['Month'] = trans_df['TransDate'].apply(lambda x:x.month)
trans_df['Year'] = trans_df['TransDate'].apply(lambda x:x.year)

In [20]:
trans_df.sample(5)

Unnamed: 0,Sl No:,Customer,TransDate,Type,Amount,Month,Year
1256,1257,A43,2006-04-03,AIR TICKET,371439.859083,4.0,2006.0
874,875,A23,2004-01-15,JEWELLERY,319592.488998,1.0,2004.0
416,417,A26,2006-03-31,TRAIN TICKET,1898.985437,3.0,2006.0
309,310,A22,2006-04-02,SANDALS,2090.225383,4.0,2006.0
772,773,A37,2005-02-16,SHOPPING,1561.739434,2.0,2005.0


In [21]:
spend_df=trans_df.groupby(['Year','Month','Type'], as_index=False)['Amount'].mean().round(2)        #2 decimal places

In [22]:
spend_df.head()

Unnamed: 0,Year,Month,Type,Amount
0,2004.0,1.0,AIR TICKET,257735.68
1,2004.0,1.0,AUTO,16260.27
2,2004.0,1.0,BIKE,176136.06
3,2004.0,1.0,BUS TICKET,10181.61
4,2004.0,1.0,CAMERA,20912.67


In [23]:
# pivot the data - need multiple index so use pivot_table() instead of pivot()
spend_pivot = pd.pivot_table(spend_df, index=['Year','Month'], columns='Type', values='Amount')

In [24]:
spend_pivot.sample(5)

Unnamed: 0_level_0,Type,AIR TICKET,AUTO,BIKE,BUS TICKET,CAMERA,CAR,CLOTHES,FOOD,JEWELLERY,MOVIE TICKET,PETRO,RENTAL,SANDALS,SHOPPING,TRAIN TICKET
Year,Month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2004.0,2.0,214071.84,,222235.99,22386.4,,,21562.59,335.97,174222.09,,562.31,,,8453.49,
2004.0,3.0,455343.36,,,,14377.59,,18170.6,446.42,156577.2,,494.72,,2372.47,,1705.33
2005.0,9.0,289240.16,,,,32782.09,,,,278895.08,954.69,490.23,,2732.14,,
2005.0,10.0,213797.64,,,4326.49,,175440.27,,,,3255.54,604.28,,,8715.27,
2005.0,11.0,121396.5,,207172.36,14339.55,,,26346.87,304.78,434209.47,409.39,556.93,3438.11,3378.69,6089.72,1693.84


In [25]:
# Lets have a look at the repayment data
payment_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 807 entries, 0 to 806
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   SL No:    807 non-null    int64         
 1   Customer  807 non-null    object        
 2   Month     803 non-null    datetime64[ns]
 3   Amount    807 non-null    float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 25.3+ KB


###### 4 rows have null in Month. Let's drop those.

In [26]:
payment_df.dropna(how="any",axis=0, inplace=True)                                       
payment_df.info()                                            # What's wrong with this?? 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 803 entries, 0 to 806
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   SL No:    803 non-null    int64         
 1   Customer  803 non-null    object        
 2   Month     803 non-null    datetime64[ns]
 3   Amount    803 non-null    float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 31.4+ KB


###### How many unique customers are there ?

In [27]:
payment_df.Customer.nunique()

100

##### Who are the top 10 customers in terms of repayment?

In [28]:
topcust_df = payment_df.groupby('Customer')['Amount'].sum().sort_values(ascending=False)
topcust_df.head(10)

Customer
A47    4.092247e+06
A61    2.776651e+06
A58    2.711600e+06
A56    2.372900e+06
A48    2.110238e+06
A36    2.068917e+06
A44    2.003032e+06
A24    1.998198e+06
A49    1.919454e+06
A42    1.841543e+06
Name: Amount, dtype: float64

##### How does spend compare to payment?

In [29]:
#Merge customer data with transaction data 
cust_trans_df = pd.merge(customer_df , trans_df , on = 'Customer' , how = 'left')

In [30]:
cust_trans_df.sample(5)

Unnamed: 0,No,Customer,Age,City,Product,Limit,Company,Segment,Sl No:,TransDate,Type,Amount,Month,Year
1087,47,A47,46.46,CHENNAI,Platimum,1380000,C9,Normal Salary,812,2005-07-03,CAMERA,36842.343744,7.0,2005.0
802,38,A38,73.0,CHENNAI,Platimum,1500000,C20,Self Employed,629,2006-03-25,CAR,253629.378057,3.0,2006.0
983,43,A43,50.0,BANGALORE,Gold,1500000,C25,Self Employed,1373,2006-11-03,MOVIE TICKET,3133.719217,11.0,2006.0
1053,46,A46,29.0,PATNA,Silver,600000,C8,Govt,724,2004-01-25,PETRO,773.981945,1.0,2004.0
1458,64,A64,55.0,DELHI,Gold,500000,C26,Salaried_MNC,66,2006-12-03,SANDALS,3038.045898,12.0,2006.0


In [31]:
# Which customers are reaching the monthly limit? 

In [32]:
monthly_spend_df = trans_df.groupby(['Customer','Year','Month'], as_index=False)['Amount'].sum()

In [33]:
monthly_spend_df = pd.merge(customer_df, monthly_spend_df, on='Customer', how='left') 

In [34]:
monthly_spend_df["SpendRatio"] = monthly_spend_df.Amount/monthly_spend_df.Limit
monthly_spend_df.query("SpendRatio > 0.9").head()

Unnamed: 0,No,Customer,Age,City,Product,Limit,Company,Segment,Year,Month,Amount,SpendRatio
4,1,A1,47.0,BANGALORE,Gold,1500000,C1,Self Employed,2005.0,2.0,1360527.0,0.907018
27,4,A4,22.0,BOMBAY,Platimum,840084,C4,Govt,2004.0,1.0,821411.0,0.977772
38,5,A5,59.0,BANGALORE,Platimum,420084,C5,Normal Salary,2006.0,3.0,393789.2,0.937406
131,13,A13,38.0,BANGALORE,Gold,500000,C13,Salaried_MNC,2006.0,11.0,462425.6,0.924851
155,15,A15,41.0,CALCUTTA,Gold,500000,C15,Govt,2005.0,9.0,493796.8,0.987594


In [35]:
monthly_spend_df.query("SpendRatio > 0.9")['Customer'].unique()

array(['A1', 'A4', 'A5', 'A13', 'A15', 'A16', 'A33', 'A48', 'A49', 'A52',
       'A53', 'A62', 'A69', 'A96', 'A99'], dtype=object)

##### Which city has the maximum spenders?

In [36]:
city_spend_df=monthly_spend_df.groupby(['City', 'Month'], as_index=False)['Amount'].sum().round(2) 

In [37]:
# reshape the data for plotting
df = city_spend_df.pivot(index='Month', columns='City', values='Amount')
df.head()

City,BANGALORE,BOMBAY,CALCUTTA,CHENNAI,COCHIN,DELHI,PATNA,TRIVANDRUM
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1.0,3126274.9,2328805.56,3244926.98,2597978.19,4617419.79,703130.98,1815232.34,814089.04
2.0,3971498.3,3008375.45,1658105.86,1969179.47,4149994.83,365936.54,751930.88,670849.37
3.0,2027004.09,2897458.18,2176908.2,1166274.69,2020514.68,1011287.4,1029935.01,881084.93
4.0,1337310.21,850195.62,1551686.8,1080791.32,2347582.32,1161682.99,148841.06,97370.19
5.0,3130804.91,778840.57,1967827.09,1924341.32,2844139.8,1072653.18,895786.43,1438594.96


In [38]:
df.plot(kind='bar')

<AxesSubplot:xlabel='Month'>

##### Which age group spends the most amount of money?

In [39]:
cust_trans_df["Age_group"] = pd.cut(x=cust_trans_df['Age'], bins=[18,30,60,100], labels=["GenX","Middle","Senior"])

In [40]:
cust_trans_df.sample(10)

Unnamed: 0,No,Customer,Age,City,Product,Limit,Company,Segment,Sl No:,TransDate,Type,Amount,Month,Year,Age_group
167,11,A11,22.0,COCHIN,Gold,500000,C11,Normal Salary,253,2006-11-21,PETRO,579.885775,11.0,2006.0,GenX
102,8,A8,50.0,PATNA,Silver,600000,C8,Salaried_Pvt,8,2004-02-05,BIKE,222235.993936,2.0,2004.0,Middle
907,41,A41,35.0,COCHIN,Platimum,1500000,C23,Govt,980,2004-01-03,BUS TICKET,33033.015523,1.0,2004.0,Middle
357,19,A19,69.0,BANGALORE,Platimum,540000,C19,Salaried_Pvt,429,2004-01-25,AIR TICKET,123887.926069,1.0,2004.0,Senior
293,16,A16,24.0,COCHIN,Gold,500000,C16,Normal Salary,358,2006-07-11,AIR TICKET,404426.490726,7.0,2006.0,GenX
360,19,A19,69.0,BANGALORE,Platimum,540000,C19,Salaried_Pvt,486,2005-08-03,SHOPPING,5364.424917,8.0,2005.0,Senior
1250,54,A54,59.0,COCHIN,Platimum,1500000,C16,Normal Salary,1123,2004-09-13,PETRO,382.575146,9.0,2004.0,Middle
1103,47,A47,46.46,CHENNAI,Platimum,1380000,C9,Normal Salary,1261,2006-08-03,AIR TICKET,469804.922867,8.0,2006.0,Middle
952,42,A42,71.0,BOMBAY,Gold,1500000,C24,Normal Salary,1372,2006-10-03,RENTAL,11993.798503,10.0,2006.0,Senior
718,34,A34,28.0,CALCUTTA,Platimum,100000,C16,Salaried_Pvt,799,2005-11-22,FOOD,256.02983,11.0,2005.0,GenX


In [41]:
spend_age_df = cust_trans_df.groupby('Age_group')['Amount'].sum()
spend_age_df

Age_group
GenX      1.807039e+07
Middle    4.398701e+07
Senior    3.021539e+07
Name: Amount, dtype: float64

##### Calculate the city wise spend on each product on yearly basis. Also include a graphical representation for the same.

##### If the monthly rate of interest is 3.2%, what is the profit for the bank for each month? 
- Profit is defined as interest earned on Monthly Profit
    - Monthly Profit = Monthly repayment – Monthly spend. 
- Interest is earned only on positive profits and not on negative amounts

#### 2. Advanced Numpy 
- The ndarray internally consists of the following:
    - A pointer to data, that is a block of system memory
    - The data type or dtype
    - A tuple indicating the array’s shape; For example, a 10 by 5 array would have shape(10, 5)
    - A tuple of strides, integers indicating the number of bytes to “step” in order to advance one element along a dimension

In [42]:
# numpy.concatenate takes a sequence (tuple, list, etc.) of arrays and joins them together in order along the input axis.
arr1 = np.array([[1, 2, 3], 
                 [4, 5, 6]])
arr2 = np.array([[7, 8, 9], 
                 [10, 11, 12]])

In [43]:
np.concatenate([arr1, arr2], axis=0)                    #same as np.vstack()             

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [44]:
np.concatenate([arr1, arr2], axis=1)                    #same as np.hstack()

array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])

In [45]:
#split() splits an array 
from numpy.random import randn
bigarr = randn(10, 2)
bigarr

array([[ 0.91075977, -0.59719955],
       [-0.00330093,  2.89970458],
       [ 0.04870181,  0.1989398 ],
       [ 0.59227996, -1.15802057],
       [ 0.26087913, -0.21914888],
       [-0.57805314,  1.4443969 ],
       [-1.08288644,  2.31126752],
       [ 0.62794786, -0.92088656],
       [ 0.17650953, -0.68851977],
       [ 0.31337244, -0.39110691]])

In [46]:
one, two, three= np.split(bigarr, [3,6])               # axis = 0 , Split is [:3] , [3:6], [6:]

In [47]:
one

array([[ 0.91075977, -0.59719955],
       [-0.00330093,  2.89970458],
       [ 0.04870181,  0.1989398 ]])

In [48]:
two

array([[ 0.59227996, -1.15802057],
       [ 0.26087913, -0.21914888],
       [-0.57805314,  1.4443969 ]])

In [49]:
three

array([[-1.08288644,  2.31126752],
       [ 0.62794786, -0.92088656],
       [ 0.17650953, -0.68851977],
       [ 0.31337244, -0.39110691]])

In [50]:
#reshape an array
arr = np.arange(15)
arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [51]:
new_array = arr.reshape((5,-1))                                # -1 => infer the second dimension
new_array

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [52]:
#flatten will flatten the array
new_array.flatten()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [53]:
arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

##### Broadcasting
- Broadcasting describes how arithmetic works between arrays of different shapes

In [54]:
arr = np.arange(5)
arr

array([0, 1, 2, 3, 4])

In [55]:
arr * 4                                              # scalar value 4 has been broadcast to all the elements

array([ 0,  4,  8, 12, 16])

In [56]:
centered = arr - arr.mean(0)
centered

array([-2., -1.,  0.,  1.,  2.])

In [57]:
#sorting
arr = randn(6)
arr

array([ 0.77227829, -0.00859321, -0.09558388,  0.06130496,  0.04049233,
       -1.57105499])

In [58]:
arr.sort()                                           #no need on inplace      
arr

array([-1.57105499, -0.09558388, -0.00859321,  0.04049233,  0.06130496,
        0.77227829])

In [59]:
arr2 = randn(3, 5)
arr2

array([[-0.03145028, -0.8240869 , -0.04059456,  0.12386198,  1.68259507],
       [-1.27770235, -2.03944718, -0.18380931,  0.49422134,  0.20063713],
       [-0.13963145,  1.09158596, -0.04953955,  0.28886299,  0.192113  ]])

In [60]:
arr2[0,:].sort()                 # sort first row values in-place
arr2

array([[-0.8240869 , -0.04059456, -0.03145028,  0.12386198,  1.68259507],
       [-1.27770235, -2.03944718, -0.18380931,  0.49422134,  0.20063713],
       [-0.13963145,  1.09158596, -0.04953955,  0.28886299,  0.192113  ]])

##### Numpy Matrix 
- NumPy has a matrix class which behaves like a 2D array and retains this structure through operations
- multiplication with * is matrix multiplication, and ** is power

In [61]:
X = np.matrix([[1, 2], [3, 4]])
X

matrix([[1, 2],
        [3, 4]])

In [62]:
#Transpose
X.T

matrix([[1, 3],
        [2, 4]])

In [63]:
#Inverse
X.I

matrix([[-2. ,  1. ],
        [ 1.5, -0.5]])

In [64]:
#Dot product 
X.I.dot(X).round(2)

array([[1., 0.],
       [0., 1.]])

In [65]:
#trace of matrix
X.trace()

matrix([[5]])

In [66]:
Y = np.matrix([[1, 1], [2, 0]])
X*Y                                         #Matrix multiplication

matrix([[ 5,  1],
        [11,  3]])

In [67]:
Y**3                                        #Matrix raised to 3 - remember Markov chain? 

matrix([[5, 3],
        [6, 2]])

#### String & Text - revisited
- There are two ways to store text data in pandas:
    - object -dtype NumPy array.
    - StringDtype extension type.
- Pandas provides a set of string functions which make it easy to operate on string data. 
- Most importantly, these functions ignore (or exclude) missing/NaN values.

In [68]:
pd.Series(["a", "b", "c"])

0    a
1    b
2    c
dtype: object

In [69]:
pd.Series(["a", "b", "c"], dtype=pd.StringDtype())

0    a
1    b
2    c
dtype: string

In [71]:
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
print(s)

0             Tom
1    William Rick
2            John
3         Alber@t
4             NaN
5            1234
6      SteveSmith
dtype: object


In [73]:
print(s.str.upper())

0             TOM
1    WILLIAM RICK
2            JOHN
3         ALBER@T
4             NaN
5            1234
6      STEVESMITH
dtype: object


In [74]:
print(s.str.len())

0     3.0
1    12.0
2     4.0
3     7.0
4     NaN
5     4.0
6    10.0
dtype: float64


In [76]:
# Dummy variables 
basket = pd.Series(['Milk ','Bread','Eggs','Beer','Diaper'])
basket.str.get_dummies()

Unnamed: 0,Beer,Bread,Diaper,Eggs,Milk
0,0,0,0,0,1
1,0,1,0,0,0
2,0,0,0,1,0
3,1,0,0,0,0
4,0,0,1,0,0


In [79]:
#Replacing a substring with another 
kwrds = pd.Series(['#Data', '#Python', '#AI'])
kwrds

0      #Data
1    #Python
2        #AI
dtype: object

In [81]:
kwrds.str.replace('#','@')

0      @Data
1    @Python
2        @AI
dtype: object

In [93]:
#splitting strings based on a character
emails = pd.Series(['sachin@gmail.com ','kamal@outlook.com'])
emails.str.split('@', expand=True)

Unnamed: 0,0,1
0,sachin,gmail.com
1,kamal,outlook.com


In [82]:
df = pd.DataFrame(np.random.randn(3, 2), 
                  columns=[" Column A ", " Column B "], 
                  index=range(3))
df

Unnamed: 0,Column A,Column B
0,1.519606,-0.37466
1,-1.295619,0.832673
2,1.235362,-0.000532


In [84]:
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df

Unnamed: 0,column_a,column_b
0,1.519606,-0.37466
1,-1.295619,0.832673
2,1.235362,-0.000532


In [95]:
#Concatenation
slower = pd.Series(["a", "b", "c", "d"], dtype="string")
slower.str.cat(sep="_")

'a_b_c_d'

In [99]:
supper = pd.Series(["A", "B", np.nan, "D"], dtype="string") 
supper.str.cat(slower)

0      Aa
1      Bb
2    <NA>
3      Dd
dtype: string

In [104]:
#Checking for a pattern match

pattern = r"[0-9][a-z]"        # a number followed by small letter

library = pd.Series(["1", "A2", "3a", "3b", "03cd", "4-d"], dtype="string")

library.str.contains(pattern)

0    False
1    False
2     True
3     True
4     True
5    False
dtype: boolean