# Question 1

For this question use the World Bank Data for Turkey for the following indicators. Use [wbgapi](https://pypi.org/project/wbgapi/) for getting the data.

* [Literacy rate, adult female (SE.ADT.LITR.FE.ZS)](https://data.worldbank.org/indicator/SE.ADT.LITR.FE.ZS)
* [Labor force, female (SL.TLF.TOTL.FE.ZS)](https://data.worldbank.org/indicator/SL.TLF.TOTL.FE.ZS)
* [Poverty headcount ratio at national poverty lines (SI.POV.NAHC)](https://data.worldbank.org/indicator/SI.POV.NAHC)
* [Current health expenditure per capita (SH.XPD.CHEX.PC.CD)](https://data.worldbank.org/indicator/SH.XPD.CHEX.PC.CD)
* [GDP per capita (NY.GDP.PCAP.CD)](https://data.worldbank.org/indicator/NY.GDP.PCAP.CD)
* [Mortality rate, under-5 (SH.DYN.MORT)](https://data.worldbank.org/indicator/SH.DYN.MORT)


Using the [statsmodels](https://www.statsmodels.org/stable/index.html) library write the best linear regression model using child mortality as the dependent variable while the rest are considered as independent variables. Pay particular attention to the fact that the order of the variables put into the model significantly impacts the performance of the model. Choose the best model by considering

* with the minimum number of variables and their interactions,
* with the optimal ordering of the independent variables and their interactions,
* $R^2$-score of the model,
* statistical significance of the model coefficients,
* ANOVA analysis of the model.


In [140]:
import pandas as pd
import numpy as np
import wbgapi as wb
import sklearn
from statsmodels.formula.api import ols
import statsmodels.api as sm


import yfinance as yf

from sklearn.preprocessing import OneHotEncoder
from statsmodels.formula.api import logit
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression

### Getting data

In [6]:
lit_rate = wb.data.DataFrame('SE.ADT.LITR.FE.ZS')
labor_force = wb.data.DataFrame('SL.TLF.TOTL.FE.ZS')
poverty_headcount_ratio = wb.data.DataFrame('SI.POV.NAHC')
healt_expenditure = wb.data.DataFrame('SH.XPD.CHEX.PC.CD')
GDP = wb.data.DataFrame('NY.GDP.PCAP.CD')
mortality_rate = wb.data.DataFrame('SH.DYN.MORT')

### Editing and merging dataframes
When we got the data, the country names were in the indexes, so I got the data with the index name "TUR" with iloc. Then I took the transpose of this data and got a dataframe which indexes are years. 

In [178]:
listt = [lit_rate,labor_force,poverty_headcount_ratio,healt_expenditure,GDP,mortality_rate]
df = pd.DataFrame()

def edit_df(listt) :
    global df
    for i in listt:
        dff = i.iloc[i.index== "TUR" ]
        dff = dff.T
        df = pd.concat([df,dff],axis=1)
    return df

df = edit_df(listt)
df.columns = ["lr","lf","ph","he","gdp","mr"] 

In [180]:
df.head(10)

Unnamed: 0,lr,lf,ph,he,gdp,mr
YR1960,,,,,509.005545,257.0
YR1961,,,,,283.828284,249.3
YR1962,,,,,309.446624,241.4
YR1963,,,,,350.662985,233.5
YR1964,,,,,369.583469,225.7
YR1965,,,,,386.358061,218.3
YR1966,,,,,444.549483,211.3
YR1967,,,,,481.69368,204.9
YR1968,,,,,526.213475,198.8
YR1969,,,,,571.61777,192.9


### Creating a linear model with ols

In [18]:
model = ols('mr ~ lr:gdp *he +lf', data=df).fit()
model.summary()



0,1,2,3
Dep. Variable:,mr,R-squared:,0.997
Model:,OLS,Adj. R-squared:,0.995
Method:,Least Squares,F-statistic:,669.1
Date:,"Mon, 07 Nov 2022",Prob (F-statistic):,4.01e-11
Time:,17:00:16,Log-Likelihood:,-3.9785
No. Observations:,14,AIC:,17.96
Df Residuals:,9,BIC:,21.15
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,96.1328,2.968,32.393,0.000,89.419,102.846
lr:gdp,-2.476e-05,6.33e-06,-3.912,0.004,-3.91e-05,-1.04e-05
he,-0.0503,0.006,-8.243,0.000,-0.064,-0.036
lr:gdp:he,5.142e-08,9.54e-09,5.388,0.000,2.98e-08,7.3e-08
lf,-1.8571,0.138,-13.470,0.000,-2.169,-1.545

0,1,2,3
Omnibus:,6.02,Durbin-Watson:,1.942
Prob(Omnibus):,0.049,Jarque-Bera (JB):,2.959
Skew:,-1.037,Prob(JB):,0.228
Kurtosis:,3.878,Cond. No.,12500000000.0


### Best model i could find

The r squared value of the model is 0.997, which means that the model can explain 99.7 percent of the variance in the target variable.When we examine the confidence intervals of the coefficients, we can see that the values have the same sign.This shows us that the direction of the features we use in the model is clear.

### Anova analysis
When we look at the anova table, we see that all the variables are statistically significant.We can also see here that the most important variable to estimating child mortality is lr:gdp.

In [19]:
sm.stats.anova_lm(model)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
lr:gdp,1.0,296.274074,296.274074,1842.661057,1.008578e-11
he,1.0,47.907603,47.907603,297.958825,3.313495e-08
lr:gdp:he,1.0,56.956874,56.956874,354.24029,1.550413e-08
lf,1.0,29.171517,29.171517,181.43072,2.861716e-07
Residual,9.0,1.447074,0.160786,,


# Question 2

For this question use Yahoo's Finance API for the following tickers:

* Gold futures (GC=F)
* Silver futures (SI=F)
* Copper futures (HG=F)
* Platinum futures (PL=F)

1. Write the best linear regression model that explains gold futures closing prices in terms of opening prices of gold, silver, copper, and platinum futures.
2. Repeat the same for silver, copper and platinum prices.
3. Compare the models you obtained in Steps 1 and 2. Which model is better? How do you decide? Explain.

###  Getting data and collecting it in a single dataframe

In [182]:
gl = yf.download('GC=F')
sl = yf.download('SI=F')
cp = yf.download('HG=F')
pl = yf.download('PL=F')


dct = {}
dct['glc'] = gl['Close']
dct['slc'] = sl['Close']
dct['cpc'] = cp['Close']
dct['plc'] = pl['Close']
dct['glo'] = gl['Open']
dct['slo'] = sl['Open']
dct['cpo'] = cp['Open']
dct['plo'] = pl['Open']
data = pd.DataFrame(dct).dropna()


[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


###  Linear regression model that explains gold futures closing prices

In [284]:
model = ols('glc ~  glo:cpo*slo + plo ', data=data).fit()
model.summary()

0,1,2,3
Dep. Variable:,glc,R-squared:,0.974
Model:,OLS,Adj. R-squared:,0.974
Method:,Least Squares,F-statistic:,46090.0
Date:,"Mon, 07 Nov 2022",Prob (F-statistic):,0.0
Time:,21:19:00,Log-Likelihood:,-28397.0
No. Observations:,4865,AIC:,56800.0
Df Residuals:,4860,BIC:,56840.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,230.9472,4.376,52.771,0.000,222.367,239.527
glo:cpo,0.2482,0.001,167.890,0.000,0.245,0.251
slo,67.1540,0.580,115.855,0.000,66.018,68.290
glo:cpo:slo,-0.0076,6.56e-05,-115.869,0.000,-0.008,-0.007
plo,-0.4831,0.006,-84.940,0.000,-0.494,-0.472

0,1,2,3
Omnibus:,170.414,Durbin-Watson:,0.056
Prob(Omnibus):,0.0,Jarque-Bera (JB):,459.175
Skew:,0.105,Prob(JB):,1.96e-100
Kurtosis:,4.49,Cond. No.,384000.0


###  for silver

In [189]:
model = ols('slc ~  glo:plo + slo ', data=data).fit()
model.summary()

0,1,2,3
Dep. Variable:,slc,R-squared:,0.999
Model:,OLS,Adj. R-squared:,0.999
Method:,Least Squares,F-statistic:,1817000.0
Date:,"Mon, 07 Nov 2022",Prob (F-statistic):,0.0
Time:,20:39:24,Log-Likelihood:,-1304.5
No. Observations:,4865,AIC:,2615.0
Df Residuals:,4862,BIC:,2635.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0398,0.011,3.681,0.000,0.019,0.061
glo:plo,1.135e-07,2.45e-08,4.627,0.000,6.54e-08,1.62e-07
slo,0.9884,0.002,445.520,0.000,0.984,0.993

0,1,2,3
Omnibus:,3600.248,Durbin-Watson:,1.991
Prob(Omnibus):,0.0,Jarque-Bera (JB):,459338.826
Skew:,-2.711,Prob(JB):,0.0
Kurtosis:,50.293,Cond. No.,3650000.0


### for copper

In [272]:
model = ols('cpc ~   cpo *plo+ cpo', data=data).fit()
model.summary()

0,1,2,3
Dep. Variable:,cpc,R-squared:,0.999
Model:,OLS,Adj. R-squared:,0.999
Method:,Least Squares,F-statistic:,1376000.0
Date:,"Mon, 07 Nov 2022",Prob (F-statistic):,0.0
Time:,21:13:00,Log-Likelihood:,8986.4
No. Observations:,4865,AIC:,-17960.0
Df Residuals:,4861,BIC:,-17940.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0100,0.005,-2.188,0.029,-0.019,-0.001
cpo,1.0023,0.001,683.535,0.000,0.999,1.005
plo,1.765e-05,6.12e-06,2.887,0.004,5.66e-06,2.96e-05
cpo:plo,-4.455e-06,1.67e-06,-2.674,0.008,-7.72e-06,-1.19e-06

0,1,2,3
Omnibus:,892.233,Durbin-Watson:,2.142
Prob(Omnibus):,0.0,Jarque-Bera (JB):,17258.548
Skew:,0.312,Prob(JB):,0.0
Kurtosis:,12.206,Cond. No.,33800.0


###  for paltinum

In [274]:
model = ols('plc ~  slo+ plo:glo+ plo', data=data).fit()
model.summary()

0,1,2,3
Dep. Variable:,plc,R-squared:,0.999
Model:,OLS,Adj. R-squared:,0.999
Method:,Least Squares,F-statistic:,2777000.0
Date:,"Mon, 07 Nov 2022",Prob (F-statistic):,0.0
Time:,21:13:14,Log-Likelihood:,-17506.0
No. Observations:,4865,AIC:,35020.0
Df Residuals:,4861,BIC:,35050.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.2291,0.563,-2.181,0.029,-2.334,-0.124
slo,0.2277,0.068,3.354,0.001,0.095,0.361
plo:glo,-3.329e-06,8.9e-07,-3.738,0.000,-5.07e-06,-1.58e-06
plo,1.0019,0.001,1444.050,0.000,1.001,1.003

0,1,2,3
Omnibus:,2694.915,Durbin-Watson:,1.918
Prob(Omnibus):,0.0,Jarque-Bera (JB):,423555.696
Skew:,-1.601,Prob(JB):,0.0
Kurtosis:,48.599,Cond. No.,6780000.0


## Model comparison
The coefficients of all the models I have created can be explained . The highest R-squared value I found while trying to create meaningful models to predict the gold closing price was 0.974. I got 0.99 R-squared while predicting silver copper and platinum prices.Since all variables of the model that predicts the silver price have a p-value close to 0, I thought the best model was for silver.

# Question 3

1. Write a function that takes a ticker symbol and returns a pandas dataframe that for each day puts a 1 when the closing price is higher than the opening price, a 0 when the closing price is lower than the opening price.
2. Write the best logistic regression that predicts the time series you obtain from Step 1 for gold futures against the opening prices of gold, silver, copper, and platinum prices.
3. Repeat the same for silver, copper, and platinum prices.
4. Compare the models you obtained from Steps 2 and 3. Decide which is the best model, and explain your reasoning.
5. Does any of the models provide a good fit? Explain.

### Creating function


In [28]:
def profit(ticker):
    dt = yf.download(ticker)
    dt["ret"] = np.where(dt.Open < dt.Close, 1, 0)
    return dt["ret"]

### Getting data from yfinance

In here, I have collected the data we use to estimate profit and loss in X dataframe.

In [286]:
dt = {}

gl = yf.download('GC=F')
sl = yf.download('SI=F')
cp = yf.download('HG=F')
pl = yf.download('PL=F')


dt["glo"] = gl["Open"]
dt['slo'] = sl['Open']
dt['cpo'] = cp['Open']
dt['plo'] = pl['Open']

X = pd.DataFrame(dt).dropna()

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


 ## logistic regression that predicts the return of gold
 
I assigned profit and loss to y as the dependent variable and concatanated y with the X dataframe I created before . Then I created the model.

In [328]:
y = profit("GC=F")
df = pd.concat([X,y],axis=1)

model = logit('ret ~ glo*plo +slo + cpo', data=df).fit()
model.summary()


[*********************100%***********************]  1 of 1 completed
Optimization terminated successfully.
         Current function value: 0.669845
         Iterations 5


0,1,2,3
Dep. Variable:,ret,No. Observations:,4865.0
Model:,Logit,Df Residuals:,4859.0
Method:,MLE,Df Model:,5.0
Date:,"Mon, 07 Nov 2022",Pseudo R-squ.:,0.01782
Time:,23:06:09,Log-Likelihood:,-3258.8
converged:,True,LL-Null:,-3317.9
Covariance Type:,nonrobust,LLR p-value:,7.251e-24

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.1426,0.266,-8.062,0.000,-2.663,-1.622
glo,0.0013,0.000,5.032,0.000,0.001,0.002
plo,0.0015,0.000,4.522,0.000,0.001,0.002
glo:plo,-1.97e-06,3.24e-07,-6.089,0.000,-2.6e-06,-1.34e-06
slo,0.1071,0.017,6.298,0.000,0.074,0.140
cpo,-0.1732,0.072,-2.391,0.017,-0.315,-0.031


## model for silver

In [329]:
y = profit("SI=F")
df = pd.concat([X,y],axis=1)
model = logit('ret ~ glo * plo  + slo', data=df).fit()
model.summary()

[*********************100%***********************]  1 of 1 completed
Optimization terminated successfully.
         Current function value: 0.630676
         Iterations 5


0,1,2,3
Dep. Variable:,ret,No. Observations:,4865.0
Model:,Logit,Df Residuals:,4860.0
Method:,MLE,Df Model:,4.0
Date:,"Mon, 07 Nov 2022",Pseudo R-squ.:,0.01424
Time:,23:06:11,Log-Likelihood:,-3068.2
converged:,True,LL-Null:,-3112.6
Covariance Type:,nonrobust,LLR p-value:,2.554e-18

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-1.4454,0.242,-5.971,0.000,-1.920,-0.971
glo,0.0002,0.000,1.270,0.204,-0.000,0.001
plo,0.0001,0.000,0.496,0.620,-0.000,0.001
glo:plo,-6.513e-07,2.93e-07,-2.221,0.026,-1.23e-06,-7.65e-08
slo,0.0711,0.016,4.400,0.000,0.039,0.103


## model for copper

In [327]:
y = profit("HG=F")
df = pd.concat([X,y],axis=1)
model = logit('ret ~ glo: plo + cpo:slo + slo ', data=df).fit()
model.summary()

[*********************100%***********************]  1 of 1 completed
Optimization terminated successfully.
         Current function value: 0.691335
         Iterations 4


0,1,2,3
Dep. Variable:,ret,No. Observations:,4865.0
Model:,Logit,Df Residuals:,4861.0
Method:,MLE,Df Model:,3.0
Date:,"Mon, 07 Nov 2022",Pseudo R-squ.:,0.002329
Time:,22:14:30,Log-Likelihood:,-3363.3
converged:,True,LL-Null:,-3371.2
Covariance Type:,nonrobust,LLR p-value:,0.001305

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.2887,0.097,-2.969,0.003,-0.479,-0.098
glo:plo,-4.975e-07,1.56e-07,-3.180,0.001,-8.04e-07,-1.91e-07
cpo:slo,-0.0078,0.003,-2.554,0.011,-0.014,-0.002
slo,0.0778,0.020,3.840,0.000,0.038,0.118


## model for platinium

In [323]:
y = profit("PL=F")
df = pd.concat([X,y],axis=1)
model = logit('ret ~ glo * cpo + plo ', data=df).fit()
model.summary()

[*********************100%***********************]  1 of 1 completed
Optimization terminated successfully.
         Current function value: 0.535087
         Iterations 6


0,1,2,3
Dep. Variable:,ret,No. Observations:,4865.0
Model:,Logit,Df Residuals:,4860.0
Method:,MLE,Df Model:,4.0
Date:,"Mon, 07 Nov 2022",Pseudo R-squ.:,0.07619
Time:,22:12:27,Log-Likelihood:,-2603.2
converged:,True,LL-Null:,-2817.9
Covariance Type:,nonrobust,LLR p-value:,1.2219999999999999e-91

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.8420,0.140,6.015,0.000,0.568,1.116
glo,-0.0017,0.000,-7.762,0.000,-0.002,-0.001
cpo,-0.2941,0.101,-2.922,0.003,-0.491,-0.097
glo:cpo,0.0002,6.76e-05,3.531,0.000,0.000,0.000
plo,-0.0001,0.000,-0.773,0.439,-0.000,0.000


### Which is the best model ?

Since the Pseudo R-squ. value is quite high in the model we created for platinum, we can say that it is the best model among them.

### Does any of the models provide a good fit?

We cannot say that the models fit well because the Pseudo R-squ values are very low for a good fit.

# Question 4

For this question use the following [data](https://archive.ics.uci.edu/ml/datasets/credit+approval):


In [331]:
credit = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data', header=None)

fn = {'+': 1, '-': 0}

X = credit.replace('?',0).iloc[:,[1,2,7,10,14]]
y = credit.iloc[:,15].map(lambda x: fn.get(x,0))

1. Split the data into training and test set.
2. Write different logistic regression models predicting y against X.
3. Construct [confusion matrices](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) on the test data set for these different models.
4. Analyze these models. Explain which model is the best model you have found.
5. Repeat Steps 1-4 several times. Does your best model stay as the best model? What should be the correct protocol to decide on the best model explaining the data?

## Editing data

Since the column names of the X dataframe are numbers, I renamed the columns with letters. then I converted it to float because the first column is of type object.

In [332]:
X.columns = ["a","b","c","d","f"]
X['a'] = X['a'].astype(float)
y = pd.DataFrame(np.array(y),columns=["y"])

## Train test split and creating models

In [377]:
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.75)

model1 = LogisticRegression(max_iter=1000, C=50)
model1.fit(X_train,y_train)

y1_predict = model1.predict(X_test)
confusion_matrix(y_test,y1_predict)


  y = column_or_1d(y, warn=True)


array([[94,  8],
       [28, 43]], dtype=int64)

In [391]:

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.75)

X_train["a*f"] = X_train["a"]* X_train["f"]
X_test["a*f"] = X_test["a"]* X_test["f"]



model2 = LogisticRegression(max_iter=1500)
model2.fit(X_train,y_train)

y2_predict = model2.predict(X_test)
confusion_matrix(y_test,y2_predict)

  y = column_or_1d(y, warn=True)


array([[82, 13],
       [31, 47]], dtype=int64)

In [368]:
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.75)

X_train["c*d"] = X_train["c"]* X_train["d"]
X_test["c*d"] = X_test["c"]* X_test["d"]


model3 = LogisticRegression(max_iter=1000)
model3.fit(X_train,y_train)

y3_predict = model3.predict(X_test)
confusion_matrix(y_test,y3_predict)

  y = column_or_1d(y, warn=True)


array([[83, 13],
       [28, 49]], dtype=int64)

### Evaluating models

When I looked at the confusion matrix the model with true positive and true neg the most was model1 .So i thought model1 as the best model.Different evaluations can be made according to the definition of the problem (recall, sensitivity)

### Repeat steps for model1

In [386]:
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.75)

model1 = LogisticRegression(max_iter=1000, C=50)
model1.fit(X_train,y_train)

y1_predict = model1.predict(X_test)
confusion_matrix(y_test,y1_predict)


  y = column_or_1d(y, warn=True)


array([[83,  7],
       [30, 53]], dtype=int64)

> I did not get similar results when I repeatedly run the model. We can use GridSearchCV method to estimate parameters better. For better evaluation we can use boostrap