# Statistical Data Analysis

__Q] Are there variables that are particularly significant in terms of explaining the answer to your project question?__

__A]__ _In the project, we are aiming to identify the fraud transactions. On initial observation, the variables which seems significant in identifying the purpose are card1-card6, C1-C14, TransactionDT and D1-D5. These variables seems to have significant role in deciding whether the isFraud should be set to 1 (Fraud) or 0 (Non-fraud). But, wihthin these variables, there is no direct correlation or any linear relation as well. Hence, we need to further investigate to find actual relationship to establish their significance._

__Q] Are there significant differences between subgroups in your data that may be relevant to your project aim?__

__A]__ _Based on initial exploration, there is no significance found between data variables which can be directly relevant towards project's aim._

__Q] Are there strong correlations between pairs of independent variables or between an independent and a dependent variable?__

__A]__ _Below we have tried to find coefficient of determination (R-squared) to identify the correlations. In this project, the dependent variable is 'isFraud' variable as we are interested in identifying how this variable is affected by other independent variables or pairs of independent variables._

_Based on previous data exploration, the variables did not have any linear correlations. So, here we are trying to find if there is any proportion of the variance or fluctuation of one variable that is predictable from the other variable. Hence, we are calculating coefficient of determination._

_The coefficient of determination is the ratio of the explained variation to the total variation. The coefficient of determination is such that $0 <  r^2 < 1$,  and denotes the strength of the linear association between x and y. The coefficient of determination represents the percent of the data that is the closest to the line of best fit._

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression

In [2]:
df_identity = pd.read_csv('train_identity.csv')
df_transaction = pd.read_csv('train_transaction.csv')

In [3]:
# The unique values for the below mentioned columns are 'Found' and 'NotFound'.
# Hence, making these columns datatype as 'Category'.
df_identity['id_12'] = pd.Categorical(values = df_identity['id_12'], categories = ['Found','NotFound'])
df_identity['id_15'] = pd.Categorical(values = df_identity['id_15'], categories = ['Found','NotFound','New'])
df_identity['id_16'] = pd.Categorical(values = df_identity['id_16'], categories = ['Found','NotFound'])
df_identity['id_27'] = pd.Categorical(values = df_identity['id_27'], categories = ['Found','NotFound'])
df_identity['id_28'] = pd.Categorical(values = df_identity['id_28'], categories = ['Found','NotFound','New'])
df_identity['id_29'] = pd.Categorical(values = df_identity['id_29'], categories = ['Found','NotFound'])

# For below mentioned columns updating the string values 'T' and 'F' to
# '1' and '0' respectively.
mapValues = {'T': 1, 'F': 0}
df_identity['id_35'] = df_identity['id_35'].map(mapValues)
df_identity['id_36'] = df_identity['id_36'].map(mapValues)
df_identity['id_37'] = df_identity['id_37'].map(mapValues)
df_identity['id_38'] = df_identity['id_38'].map(mapValues)

# Since, most of the columns have values as 'NaN' where 'id_35' is equal to 'NaN', hence
# removing those rows from the dataset. This removes only 2.25% of total data
#df1 = df1[df1['id_35'].notnull()]

# Replacing the NaN values of 'DeviceType' column with 'desktop' because as per analysis
# for the mentioned 'DeviceInfo' and its combination with column 'id_31', the device type is equal to 'desktop'
df_identity['DeviceType'].loc[df_identity['DeviceInfo'].isin(['Windows','rv:11.0','Trident/7.0'])] = 'desktop'

####################################################################################

# Adding new column TransactionDay by calculating the value from TransactionDT column
# math.ceil returns smallest integer greater than the provided value.
df_transaction['TransactionDay']=np.ceil(df_transaction['TransactionDT']/60/60/24).astype('int')

# Replacing NaN values of card4 based on common card1 values
df_transaction.loc[df_transaction.card4.isnull(), 'card4'] = \
df_transaction.loc[df_transaction.card4.isnull(), 'card1'].map(df_transaction[df_transaction.card4.notnull()] \
                    [['card1','card4']].sort_values('card1').drop_duplicates().set_index('card1')['card4'])

#Replace few card6 values to valid specific values
df_transaction.card6.replace(to_replace=['debit or credit'], value=['debit'],inplace=True)
df_transaction.card6.replace(to_replace=['charge card'], value=['credit'],inplace=True)

# For below mentioned columns updating the string values 'T' and 'F' to
# '1' and '0' respectively.
df_transaction['M1'] = df_transaction['M1'].map(mapValues)
df_transaction['M2'] = df_transaction['M2'].map(mapValues)
df_transaction['M3'] = df_transaction['M3'].map(mapValues)
df_transaction['M5'] = df_transaction['M5'].map(mapValues)
df_transaction['M6'] = df_transaction['M6'].map(mapValues)
df_transaction['M7'] = df_transaction['M7'].map(mapValues)
df_transaction['M8'] = df_transaction['M8'].map(mapValues)
df_transaction['M9'] = df_transaction['M9'].map(mapValues)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [4]:
# Creating main dataset by merging both the DataFrames created from two csv files
dataset = pd.merge(df_transaction,df_identity,on='TransactionID',how='left')

In [5]:
dataset.shape

(590540, 435)

In [6]:
dataset.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,samsung browser 6.2,32.0,2220x1080,match_status:2,1.0,0.0,1.0,1.0,mobile,SAMSUNG SM-G892A Build/NRD90M


In [7]:
def coeff_of_determination(X,Y):
    # Model Intialization
    reg = LinearRegression()
    # Data Fitting
    reg = reg.fit(X, Y)

    # Model Evaluation
    r2 = reg.score(X, Y)
    return r2

In [8]:
X = np.array([dataset.card1, dataset.C1, dataset.TransactionDay]).T
Y = np.array(dataset.isFraud)
coeff_of_determination(X,Y)

0.0013461555872830155

In [9]:
X = np.array([dataset.card1, dataset.C1, dataset.C2, dataset.C3, dataset.C4]).T
Y = np.array(dataset.isFraud)
coeff_of_determination(X,Y)

0.006447940482810721

In [10]:
X = np.array([dataset.card1, dataset.C1,dataset.D1.fillna(999999)]).T
Y = np.array(dataset.isFraud)
coeff_of_determination(X,Y)

0.0011253907612214231

In [11]:
X = np.array([dataset.card1, dataset.C3]).T
Y = np.array(dataset.isFraud)
coeff_of_determination(X,Y)

0.00023362414451610913

In [12]:
mapValuesCD = {'W': 1, 'H': 2, 'C': 3, 'S': 4, 'R': 5}
dataset['ProductCD'] = dataset['ProductCD'].map(mapValuesCD)

In [13]:
X = np.array([dataset.card1,dataset.C1,dataset.ProductCD]).T
Y = np.array(dataset.isFraud)
coeff_of_determination(X,Y)

0.011202165401961173

In [14]:
X = np.array([dataset.card1,dataset.C1,dataset.V1.fillna(999999)]).T
Y = np.array(dataset.isFraud)
coeff_of_determination(X,Y)

0.008722178048638174

In [15]:
#Using Ordinary Least Squared method to get R-squared
reg = smf.ols(formula = "isFraud ~ card1+C1+C2+C3+C4+C5+C6", data = dataset).fit()
reg.summary()

0,1,2,3
Dep. Variable:,isFraud,R-squared:,0.008
Model:,OLS,Adj. R-squared:,0.008
Method:,Least Squares,F-statistic:,654.7
Date:,"Wed, 18 Dec 2019",Prob (F-statistic):,0.0
Time:,02:34:45,Log-Likelihood:,164810.0
No. Observations:,590540,AIC:,-329600.0
Df Residuals:,590532,BIC:,-329500.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0405,0.001,75.012,0.000,0.039,0.042
card1,-4.996e-07,4.86e-08,-10.279,0.000,-5.95e-07,-4.04e-07
C1,-0.0006,2.2e-05,-25.619,0.000,-0.001,-0.001
C2,0.0008,1.73e-05,47.433,0.000,0.001,0.001
C3,-0.0085,0.002,-5.388,0.000,-0.012,-0.005
C4,-0.0004,3.07e-05,-11.754,0.000,-0.000,-0.000
C5,-0.0002,2.14e-05,-8.483,0.000,-0.000,-0.000
C6,-0.0003,3.03e-05,-9.610,0.000,-0.000,-0.000

0,1,2,3
Omnibus:,567659.541,Durbin-Watson:,1.915
Prob(Omnibus):,0.0,Jarque-Bera (JB):,15998469.518
Skew:,5.016,Prob(JB):,0.0
Kurtosis:,26.442,Cond. No.,73400.0


In [16]:
#Using Ordinary Least Squared method to get R-squared
reg = smf.ols(formula = "isFraud ~ card1", data = dataset).fit()
reg.summary()

0,1,2,3
Dep. Variable:,isFraud,R-squared:,0.0
Model:,OLS,Adj. R-squared:,0.0
Method:,Least Squares,F-statistic:,109.9
Date:,"Wed, 18 Dec 2019",Prob (F-statistic):,1.04e-25
Time:,02:34:47,Log-Likelihood:,162580.0
No. Observations:,590540,AIC:,-325200.0
Df Residuals:,590538,BIC:,-325100.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0401,0.001,74.329,0.000,0.039,0.041
card1,-5.114e-07,4.88e-08,-10.483,0.000,-6.07e-07,-4.16e-07

0,1,2,3
Omnibus:,570877.668,Durbin-Watson:,1.915
Prob(Omnibus):,0.0,Jarque-Bera (JB):,16232378.519
Skew:,5.06,Prob(JB):,0.0
Kurtosis:,26.607,Cond. No.,24900.0


In [17]:
#Using Ordinary Least Squared method to get R-squared
reg = smf.ols(formula = "isFraud ~ C1", data = dataset).fit()
reg.summary()

0,1,2,3
Dep. Variable:,isFraud,R-squared:,0.001
Model:,OLS,Adj. R-squared:,0.001
Method:,Least Squares,F-statistic:,552.4
Date:,"Wed, 18 Dec 2019",Prob (F-statistic):,4.3500000000000004e-122
Time:,02:34:53,Log-Likelihood:,162800.0
No. Observations:,590540,AIC:,-325600.0
Df Residuals:,590538,BIC:,-325600.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0344,0.000,143.123,0.000,0.034,0.035
C1,4.206e-05,1.79e-06,23.503,0.000,3.85e-05,4.56e-05

0,1,2,3
Omnibus:,570488.355,Durbin-Watson:,1.915
Prob(Omnibus):,0.0,Jarque-Bera (JB):,16200026.346
Skew:,5.055,Prob(JB):,0.0
Kurtosis:,26.584,Cond. No.,135.0


__Q] What are the most appropriate tests to use to analyze these relationships?__

__A]__ 