### Phase 2 Linear Regression
#### In action

Using previously implemented functions of linear regression.

In [87]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.metrics import r2_score
from datetime import datetime
from pandas import Timestamp
import matplotlib.pyplot as plt
from collections import Counter

In [11]:
def add_bias_column(X):
    """
    Args:a
        X (array): can be either 1-d or 2-d
    
    Returns:
        Xnew (array): the same array, but 2-d with a column of 1's in the first spot
    """
    
    # If the array is 1-d
    if len(X.shape) == 1:
        Xnew = np.column_stack([np.ones(X.shape[0]), X])
    
    # If the array is 2-d
    elif len(X.shape) == 2:
        bias_col = np.ones((X.shape[0], 1))
        Xnew = np.hstack([bias_col, X])
        
    else:
        raise ValueError("Input array must be either 1-d or 2-d")

    return Xnew

In [12]:
def line_of_best_fit(X, y):
    """
    Description: 
        Determines the coefficients for the line of best fit in y = mx + b form.
    
    Args:
        X (array): either 1D or 2D, contains predictor values
        y (array): 1D, contains dependent values
    
    Returns:
        coefs (array): 1D, has the coefficients of the line of best fit in the form [m, b]
    """
    X = add_bias_column(X)
    return np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T, X)), X.T), y)

In [13]:
def linreg_predict(Xnew, ynew, m):
    """
    Description:
        Determines the validity of output of applying provided values to the linear regression model.
    
    Args:
        Xnew (array): 1D or 2D, includes all p predictor features without bias term
        ynew (array): 1D, includes all corresponding values to Xnew
        m (array): 1D, len = p + 1, contains coefficients from 'line_of_best_fit' function
    
    Returns: (in key value pairs in dictionary form)
        ypreds (array): predicted values of m --> Xnew
        resids (array): differences between ynew and ypreds (in essence, ynew - ypreds)
        mse(float): mean squared error
        r2(float): coefficient of determination representing the amount of variability in 
                   ynew that is explained by the LOBF
    """
    # initialize results and other data structures
    results = {} 
                
    # calculations
    results['ypreds'] = np.dot(add_bias_column(Xnew), m)
    results['resids'] = ynew - results['ypreds']
    results['mse'] = (results['resids']**2).mean()
    results['r2'] = sklearn.metrics.r2_score(ynew, results['ypreds'])
    
    return results

Importing in the data for the regression.

In [75]:
df = pd.read_csv('Data News Sources.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,date,sentiment,text,source_country,queried_country,url,Safety Index
0,0,2019-06-12 18:48:59,-0.308,It’s time we start talking about climate chang...,mx,Russia,https://www.amnesty.org/en/latest/news/2019/06...,0.585
1,1,2019-07-12 17:14:00,-0.108,"Even now, as more frequent ""king tides"" bubble...",us,China,https://edition.cnn.com/2019/07/11/us/miami-li...,0.784286
2,2,2019-10-23 15:32:04,0.292,The second meeting of the Board of senior memb...,uz,Russia,http://www.uzreport.com/sco-interbank-associat...,0.585
3,3,2019-10-23 15:34:13,0.398,The Shanghai Cooperation Organization establis...,uz,Russia,http://www.uzreport.com/entrepreneur-committee...,0.585
4,4,2019-10-23 15:37:39,0.146,All participants of the exhibition “Tea and Co...,uz,Russia,http://www.uzreport.com/over-7000-people-visit...,0.585


Defining the variables to be used for linear regression. In this case we are seeing how much time effects both the safety index and the sentiment of a given country. This analysis will be explored by determining the relationship with some brief visual effects and linear regression. This is begun by scaling each quantitiative variable.

In [104]:
df['sentiment'] = (df['sentiment'] - df['sentiment'].mean()) / df['sentiment'].mean()
df['Safety Index'] = (df['Safety Index'] - df['Safety Index'].mean()) / df['Safety Index'].mean()

# for row in df['date']:
#     datetime.timestamp(datetime.strptime(row, '%Y-%m-%d %H:%M:%S'))
    
df.head()

Unnamed: 0.1,Unnamed: 0,date,sentiment,text,source_country,queried_country,url,Safety Index
0,0,2019-06-12 18:48:59,1.702261e+16,It’s time we start talking about climate chang...,mx,Russia,https://www.amnesty.org/en/latest/news/2019/06...,-177297300000000.0
1,1,2019-07-12 17:14:00,7365913000000000.0,"Even now, as more frequent ""king tides"" bubble...",us,China,https://edition.cnn.com/2019/07/11/us/miami-li...,354594600000000.0
2,2,2019-10-23 15:32:04,-1.194748e+16,The second meeting of the Board of senior memb...,uz,Russia,http://www.uzreport.com/sco-interbank-associat...,-177297300000000.0
3,3,2019-10-23 15:34:13,-1.706553e+16,The Shanghai Cooperation Organization establis...,uz,Russia,http://www.uzreport.com/entrepreneur-committee...,-177297300000000.0
4,4,2019-10-23 15:37:39,-4898091000000000.0,All participants of the exhibition “Tea and Co...,uz,Russia,http://www.uzreport.com/over-7000-people-visit...,-177297300000000.0


In [108]:
df_Rus = df.loc[df['queried_country'] == 'Russia'] # grab all rows of data from Russia
X = df_Rus['sentiment']
y = df_Rus['Safety Index']

m = line_of_best_fit(X, y)
print(m)

results = linreg_predict(df['sentiment'], df['Safety Index'], m)

[-1.77297269e+14 -2.16840434e-19]
