For this journal we will attempt our first prediction models.
As always, let's import our libraries

In [26]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm
import statsmodels.api as sm
from statsmodels.stats.proportion import proportions_ztest
import mpl_toolkits.mplot3d as m3d
from scipy import special
import os

Now let's import our dataset, which is the dataset we created at the end of our previous journal

In [27]:
df = pd.read_csv('/Users/lokikeeler/Downloads/train-balanced-sarcasm_2.csv')
df.head()

Unnamed: 0,label,comment,score,ups,downs,date,created_utc,parent_comment,year,SUB_2007scape,...,SUB_television,SUB_tf2,SUB_todayilearned,SUB_trees,SUB_ukpolitics,SUB_unitedkingdom,SUB_videos,SUB_worldnews,SUB_wow,SUB_xboxone
0,0,NC and NH.,2,-1,-1,2016-10-01,2016-10-16 23:55:23,"Yeah, I get that argument. At this point, I'd ...",2016,False,...,False,False,False,False,False,False,False,False,False,False
1,0,You do know west teams play against west teams...,-4,-1,-1,2016-11-01,2016-11-01 00:24:10,The blazers and Mavericks (The wests 5 and 6 s...,2016,False,...,False,False,False,False,False,False,False,False,False,False
2,0,"They were underdogs earlier today, but since G...",3,3,0,2016-09-01,2016-09-22 21:45:37,They're favored to win.,2016,False,...,False,False,False,False,False,False,False,False,False,False
3,0,"This meme isn't funny none of the ""new york ni...",-8,-1,-1,2016-10-01,2016-10-18 21:03:47,deadass don't kill my buzz,2016,False,...,False,False,False,False,False,False,False,False,False,False
4,0,"I don't pay attention to her, but as long as s...",0,0,0,2016-09-01,2016-09-02 10:35:08,do you find ariana grande sexy ?,2016,False,...,False,False,False,False,False,False,False,False,False,False


For our first logistic regression model we need to edit our columns so we don't include object data types and also prevent colinearity. Comment and parent comment are object data types. Also we have year, so we don't need date and created_utc.

In [28]:
df_2 = df.drop(['comment', 'date', 'created_utc', 'parent_comment'], axis=1)
df_2.head()

Unnamed: 0,label,score,ups,downs,year,SUB_2007scape,SUB_AdviceAnimals,SUB_Android,SUB_AskMen,SUB_AskReddit,...,SUB_television,SUB_tf2,SUB_todayilearned,SUB_trees,SUB_ukpolitics,SUB_unitedkingdom,SUB_videos,SUB_worldnews,SUB_wow,SUB_xboxone
0,0,2,-1,-1,2016,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,0,-4,-1,-1,2016,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,0,3,3,0,2016,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,0,-8,-1,-1,2016,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,0,0,0,0,2016,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False


Great. Now let's pull up some initial correlation values

In [29]:
df_2.corr().style.background_gradient()

Unnamed: 0,label,score,ups,downs,year,SUB_2007scape,SUB_AdviceAnimals,SUB_Android,SUB_AskMen,SUB_AskReddit,SUB_BigBrother,SUB_Bitcoin,SUB_BlackPeopleTwitter,SUB_CFB,SUB_CODZombies,SUB_Christianity,SUB_CoDCompetitive,SUB_Conservative,SUB_CringeAnarchy,SUB_DestinyTheGame,SUB_DotA2,SUB_Eve,SUB_Futurology,SUB_GlobalOffensive,SUB_KotakuInAction,SUB_Libertarian,SUB_MMA,SUB_MensRights,SUB_Monstercat,SUB_NoMansSkyTheGame,SUB_Overwatch,SUB_PS4,SUB_Planetside,SUB_Rainbow6,SUB_SandersForPresident,SUB_Showerthoughts,SUB_Smite,SUB_SquaredCircle,SUB_SubredditDrama,SUB_The_Donald,SUB_TrollXChromosomes,SUB_TumblrInAction,SUB_TwoXChromosomes,SUB_WTF,SUB_anime,SUB_apple,SUB_atheism,SUB_australia,SUB_aww,SUB_baseball,SUB_bravefrontier,SUB_canada,SUB_cars,SUB_childfree,SUB_conspiracy,SUB_creepyPMs,SUB_cringepics,SUB_europe,SUB_exmormon,SUB_explainlikeimfive,SUB_fantasyfootball,SUB_fatlogic,SUB_fivenightsatfreddys,SUB_formula1,SUB_funny,SUB_gaming,SUB_gifs,SUB_guns,SUB_hearthstone,SUB_heroesofthestorm,SUB_hillaryclinton,SUB_hiphopheads,SUB_hockey,SUB_india,SUB_leagueoflegends,SUB_magicTCG,SUB_mildlyinteresting,SUB_movies,SUB_nba,SUB_news,SUB_nfl,SUB_nottheonion,SUB_pathofexile,SUB_pcmasterrace,SUB_pics,SUB_pokemon,SUB_pokemongo,SUB_politics,SUB_relationships,SUB_runescape,SUB_smashbros,SUB_soccer,SUB_starcraft,SUB_technology,SUB_teenagers,SUB_television,SUB_tf2,SUB_todayilearned,SUB_trees,SUB_ukpolitics,SUB_unitedkingdom,SUB_videos,SUB_worldnews,SUB_wow,SUB_xboxone
label,1.0,-0.016425,0.00055,0.054518,-0.049559,-0.001008,0.011097,-0.003764,-0.009118,-0.086663,-0.002955,0.001859,-0.01406,-0.002775,-0.001397,0.003769,0.003964,0.01297,-0.005401,-0.00019,-0.005769,-0.010992,0.004075,0.007625,0.0059,0.015325,-0.000544,0.022011,0.001416,0.001532,-0.011001,-0.004402,0.003434,0.001614,0.002772,-0.010023,0.00077,-0.004277,0.009513,-0.016248,-0.000669,0.017805,0.01043,-0.016257,-0.013482,0.001017,0.025696,0.012113,-0.026122,-0.003839,-0.00079,0.010848,-0.003785,0.006978,0.009635,0.047167,0.002023,0.011381,0.003503,-0.004761,-0.018283,0.011663,0.000744,0.002463,-0.024014,-0.009557,-0.020254,-0.000832,-5.8e-05,0.000244,0.0028,-0.010086,0.001037,0.003214,0.007666,-0.001678,-0.013024,-0.007134,0.001157,0.026698,-0.004279,0.009973,-0.004366,0.012173,-0.011739,-0.01215,-0.003586,0.042832,0.001361,-0.00153,0.004049,0.006325,0.003652,0.014999,-0.006463,-0.003393,-0.00023,0.007853,-0.012253,0.00568,0.005961,-0.007743,0.051549,-0.010529,-0.00351
score,-0.016425,1.0,0.876223,-0.032451,0.027104,-0.010797,-0.012567,0.005614,0.003988,-0.03868,0.021676,-0.001706,0.013505,0.032407,-0.002235,0.009947,0.002971,0.014652,0.014447,-0.009132,-0.011158,7.8e-05,-0.004407,-0.017304,0.026803,0.005308,0.006753,0.017486,0.006245,-0.002102,-0.002549,-0.009032,-0.001775,-0.000132,0.007814,-0.003919,-0.008469,0.011246,0.028911,0.036591,0.041405,0.033456,0.008004,-0.016366,0.007569,-0.003311,0.002236,0.022405,-0.004949,0.02451,-0.003964,0.001153,0.00778,0.030928,-0.004984,0.07482,0.012953,0.018132,0.021449,-0.006083,-0.00237,0.036939,-0.015564,0.016177,-0.020822,-0.018419,-0.01201,0.009603,-0.003766,-0.00088,0.02806,0.012775,0.0308,0.007146,-0.035704,0.006132,-0.005029,-0.005792,0.008325,-0.013724,0.018338,0.001327,-0.009374,-0.001466,-0.029915,0.007142,-0.005418,0.003774,0.027766,-0.008247,0.011746,0.003057,-0.002975,-0.008819,0.003114,-0.001969,0.007639,-0.019905,0.003561,0.003236,0.007598,-0.024089,-0.028664,-0.002879,-0.007716
ups,0.00055,0.876223,1.0,0.29943,-0.078391,-0.010935,-0.00022,0.005156,0.005539,-0.033775,0.013486,0.002815,0.006615,0.024021,-0.006512,0.013121,0.00482,0.014438,0.006904,-0.010317,-0.009244,0.001592,-0.007027,-0.010763,0.028279,0.009208,0.002359,0.024041,0.008276,-0.001817,-0.014804,-0.007441,0.001106,-0.009982,0.017065,-0.009182,-0.012805,0.010783,0.034139,-0.010599,0.039571,0.038824,0.00674,-0.006541,0.007389,-0.005639,0.010623,0.022865,-0.006034,0.021664,-0.004027,-1.2e-05,0.007926,0.030314,-0.008274,0.083259,0.019329,0.015711,0.015432,-0.002677,-0.011467,0.03807,-0.019475,0.009296,-0.009889,-0.014582,-0.012233,0.011485,-0.008482,-0.003752,0.011644,0.011727,0.030213,0.009143,-0.029375,0.008849,-0.007214,-0.001837,0.01093,-0.008208,0.014385,0.002453,-0.012826,0.003237,-0.020971,-0.003558,-0.006749,-0.027223,0.026221,-0.007623,0.015274,0.007553,0.000423,-0.004025,0.002988,-0.003809,0.008362,-0.012861,0.004565,1.5e-05,0.009591,-0.016623,-0.018466,-0.00705,-0.006182
downs,0.054518,-0.032451,0.29943,1.0,-0.31926,-0.002521,0.035194,0.001562,0.004857,-0.003752,-0.011074,0.012418,-0.009839,-0.008368,-0.017125,0.010372,0.005428,0.003951,-0.015005,-0.009238,0.003604,0.002725,-0.016015,0.011928,0.010406,0.013015,-0.009351,0.020367,0.00964,0.00186,-0.045014,-0.001423,0.007773,-0.031239,0.027016,-0.01582,-0.01693,0.001266,0.017987,-0.111211,0.005258,0.018822,-0.000223,0.027341,0.000465,-0.011074,0.025725,0.007342,-0.003691,-0.000608,-0.002191,-0.006637,0.003181,0.005893,-0.00969,0.026853,0.018168,-0.001547,-0.008819,0.008879,-0.031882,0.01081,-0.021435,-0.011457,0.028694,0.008358,-0.005923,0.007406,-0.012563,-0.008735,-0.031498,0.003503,0.00861,0.008497,0.010803,0.008462,-0.012442,0.007891,0.011871,0.012986,-0.009719,0.00159,-0.016396,0.009161,0.017546,-0.032238,-0.008662,-0.077939,0.004208,-0.000842,0.013557,0.01462,0.012682,0.011965,0.003782,-0.006789,0.003746,0.019093,0.003188,-0.009676,0.004512,0.017258,0.023603,-0.014093,0.001746
year,-0.049559,0.027104,-0.078391,-0.31926,1.0,0.029918,-0.053869,-0.00837,0.000901,-0.007533,0.021282,-0.014373,0.026743,-0.00975,0.044058,-0.012317,0.010666,-0.000865,0.034383,0.029749,0.006535,0.007317,0.01713,0.05465,0.016368,-0.060016,0.027032,-0.060316,0.005573,0.037223,0.064241,0.007443,-0.004461,0.038848,0.044415,0.036896,0.026957,0.023211,-0.024569,0.102909,0.004663,0.007644,-0.010662,-0.083594,0.007261,-0.000276,-0.090163,-0.00638,-0.002729,0.007667,0.031233,-0.001534,0.00594,0.007242,-0.012847,-0.037865,-0.033185,0.023434,0.025308,-0.001166,0.007529,0.010323,0.050682,0.016625,-0.07855,-0.059565,0.022085,-0.046876,0.030391,0.027565,0.044624,0.004063,0.01454,-0.001658,0.018673,0.007522,0.016139,0.011864,0.015115,0.01515,0.007346,0.012541,0.019902,0.048733,-0.087496,0.007406,0.051148,-0.005213,0.01036,0.014239,0.00449,-0.006382,-0.047888,-0.045108,-0.018883,0.019143,0.005318,-0.02569,-0.018313,0.016465,0.005368,-0.02388,-0.032403,0.023936,0.01177
SUB_2007scape,-0.001008,-0.010797,-0.010935,-0.002521,0.029918,1.0,-0.011355,-0.005752,-0.004218,-0.026646,-0.003945,-0.004824,-0.003877,-0.0081,-0.004415,-0.004349,-0.004178,-0.004231,-0.003999,-0.007035,-0.007591,-0.003771,-0.003783,-0.011554,-0.005408,-0.005009,-0.005792,-0.005602,-0.003846,-0.003779,-0.00623,-0.004462,-0.003844,-0.003854,-0.005107,-0.00611,-0.005811,-0.006717,-0.004545,-0.009517,-0.004077,-0.008315,-0.00381,-0.009535,-0.003952,-0.004107,-0.008488,-0.006542,-0.004819,-0.005723,-0.004124,-0.006199,-0.004113,-0.004279,-0.005599,-0.006843,-0.004533,-0.006371,-0.004955,-0.004034,-0.004697,-0.004501,-0.005907,-0.004893,-0.013201,-0.010704,-0.007072,-0.004485,-0.0064,-0.004139,-0.004125,-0.004771,-0.009022,-0.007195,-0.014407,-0.004662,-0.004792,-0.007859,-0.011473,-0.012782,-0.011455,-0.004836,-0.003938,-0.01361,-0.012537,-0.004836,-0.004729,-0.020083,-0.0041,-0.004333,-0.005145,-0.008702,-0.004069,-0.007358,-0.004732,-0.003925,-0.004846,-0.011694,-0.005174,-0.003787,-0.004874,-0.010877,-0.01618,-0.005892,-0.005008
SUB_AdviceAnimals,0.011097,-0.012567,-0.00022,0.035194,-0.053869,-0.011355,1.0,-0.011788,-0.008644,-0.054609,-0.008084,-0.009886,-0.007946,-0.0166,-0.009049,-0.008913,-0.008562,-0.008671,-0.008197,-0.014417,-0.015557,-0.007729,-0.007754,-0.023679,-0.011084,-0.010266,-0.011871,-0.011481,-0.007882,-0.007746,-0.012767,-0.009145,-0.007879,-0.007898,-0.010466,-0.012523,-0.011909,-0.013766,-0.009316,-0.019504,-0.008356,-0.01704,-0.007808,-0.01954,-0.0081,-0.008417,-0.017396,-0.013407,-0.009876,-0.01173,-0.008452,-0.012705,-0.00843,-0.008769,-0.011475,-0.014023,-0.00929,-0.013057,-0.010155,-0.008266,-0.009626,-0.009224,-0.012107,-0.010028,-0.027055,-0.021937,-0.014493,-0.009191,-0.013116,-0.008482,-0.008455,-0.009778,-0.01849,-0.014745,-0.029526,-0.009555,-0.009822,-0.016107,-0.023514,-0.026196,-0.023475,-0.00991,-0.008071,-0.027892,-0.025694,-0.00991,-0.009693,-0.041159,-0.008402,-0.00888,-0.010545,-0.017835,-0.008338,-0.015079,-0.009697,-0.008045,-0.009932,-0.023967,-0.010603,-0.007762,-0.00999,-0.022291,-0.03316,-0.012075,-0.010264
SUB_Android,-0.003764,0.005614,0.005156,0.001562,-0.00837,-0.005752,-0.011788,1.0,-0.004379,-0.027661,-0.004095,-0.005008,-0.004025,-0.008408,-0.004583,-0.004515,-0.004337,-0.004392,-0.004152,-0.007303,-0.00788,-0.003915,-0.003928,-0.011995,-0.005615,-0.0052,-0.006013,-0.005815,-0.003992,-0.003923,-0.006467,-0.004632,-0.003991,-0.004001,-0.005301,-0.006343,-0.006032,-0.006973,-0.004719,-0.009879,-0.004233,-0.008632,-0.003955,-0.009898,-0.004103,-0.004264,-0.008812,-0.006791,-0.005002,-0.005942,-0.004281,-0.006436,-0.00427,-0.004442,-0.005813,-0.007103,-0.004706,-0.006614,-0.005144,-0.004187,-0.004876,-0.004672,-0.006133,-0.00508,-0.013704,-0.011112,-0.007341,-0.004656,-0.006644,-0.004297,-0.004283,-0.004953,-0.009366,-0.007469,-0.014956,-0.00484,-0.004975,-0.008159,-0.011911,-0.013269,-0.011891,-0.00502,-0.004088,-0.014129,-0.013015,-0.00502,-0.00491,-0.020849,-0.004256,-0.004498,-0.005341,-0.009034,-0.004224,-0.007638,-0.004912,-0.004075,-0.005031,-0.01214,-0.005371,-0.003932,-0.00506,-0.011291,-0.016797,-0.006117,-0.005199
SUB_AskMen,-0.009118,0.003988,0.005539,0.004857,0.000901,-0.004218,-0.008644,-0.004379,1.0,-0.020284,-0.003003,-0.003672,-0.002952,-0.006166,-0.003361,-0.003311,-0.00318,-0.003221,-0.003045,-0.005355,-0.005779,-0.002871,-0.00288,-0.008796,-0.004117,-0.003813,-0.00441,-0.004264,-0.002928,-0.002877,-0.004742,-0.003397,-0.002927,-0.002934,-0.003887,-0.004652,-0.004424,-0.005113,-0.00346,-0.007245,-0.003104,-0.00633,-0.0029,-0.007258,-0.003009,-0.003126,-0.006462,-0.00498,-0.003668,-0.004357,-0.00314,-0.004719,-0.003131,-0.003257,-0.004262,-0.005209,-0.003451,-0.00485,-0.003772,-0.003071,-0.003576,-0.003426,-0.004497,-0.003725,-0.010049,-0.008148,-0.005383,-0.003414,-0.004872,-0.003151,-0.003141,-0.003632,-0.006868,-0.005477,-0.010967,-0.003549,-0.003648,-0.005983,-0.008734,-0.00973,-0.00872,-0.003681,-0.002998,-0.010361,-0.009544,-0.003681,-0.0036,-0.015288,-0.003121,-0.003298,-0.003917,-0.006625,-0.003097,-0.005601,-0.003602,-0.002988,-0.003689,-0.008902,-0.003939,-0.002883,-0.003711,-0.00828,-0.012317,-0.004485,-0.003812
SUB_AskReddit,-0.086663,-0.03868,-0.033775,-0.003752,-0.007533,-0.026646,-0.054609,-0.027661,-0.020284,1.0,-0.01897,-0.0232,-0.018646,-0.038953,-0.021234,-0.020916,-0.020093,-0.020348,-0.019234,-0.033831,-0.036506,-0.018137,-0.018195,-0.055566,-0.026011,-0.024089,-0.027857,-0.02694,-0.018495,-0.018176,-0.02996,-0.021459,-0.018489,-0.018533,-0.024558,-0.029387,-0.027946,-0.032304,-0.02186,-0.045767,-0.019608,-0.039987,-0.018323,-0.045854,-0.019007,-0.019751,-0.040821,-0.031461,-0.023174,-0.027525,-0.019834,-0.029813,-0.019781,-0.020577,-0.026927,-0.032907,-0.021801,-0.030639,-0.023829,-0.019398,-0.022589,-0.021644,-0.02841,-0.023532,-0.063487,-0.051477,-0.034009,-0.021568,-0.030777,-0.019905,-0.01984,-0.022945,-0.04339,-0.034601,-0.069285,-0.022422,-0.023047,-0.037797,-0.055178,-0.061471,-0.055087,-0.023255,-0.018939,-0.065452,-0.060294,-0.023255,-0.022744,-0.096583,-0.019716,-0.020838,-0.024744,-0.041851,-0.019566,-0.035385,-0.022755,-0.018878,-0.023306,-0.05624,-0.024882,-0.018214,-0.023441,-0.052308,-0.077813,-0.028335,-0.024085


Looking at the first column of correlations with Label, we can see all correlations are very small. The greatest in a negative direction is the subreddit Ask Reddit with -0.09. Other higher correlations are about 0.05. Overall, these are very low

Let's turn our dummy columns into floats so our model can work

In [30]:
df_2 = df_2.astype(np.float64)

Now let's define X as all the columns right of Label, and our y as label. Then lets add the constant

In [31]:
X = df_2[df_2.columns[1:]]
y = df_2['label']

X_withconstant = sm.add_constant(X)

X_withconstant.head()

Unnamed: 0,const,score,ups,downs,year,SUB_2007scape,SUB_AdviceAnimals,SUB_Android,SUB_AskMen,SUB_AskReddit,...,SUB_television,SUB_tf2,SUB_todayilearned,SUB_trees,SUB_ukpolitics,SUB_unitedkingdom,SUB_videos,SUB_worldnews,SUB_wow,SUB_xboxone
0,1.0,2.0,-1.0,-1.0,2016.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,-4.0,-1.0,-1.0,2016.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,3.0,3.0,0.0,2016.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,-8.0,-1.0,-1.0,2016.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,2016.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next let's create our logistic model and fit the model.

In [32]:
mylogreg= sm.Logit(y,X_withconstant)
mylogreg_results = mylogreg.fit()
mylogreg_results.summary()

Optimization terminated successfully.
         Current function value: 0.678544
         Iterations 5


0,1,2,3
Dep. Variable:,label,No. Observations:,549904.0
Model:,Logit,Df Residuals:,549799.0
Method:,MLE,Df Model:,104.0
Date:,"Tue, 02 Jan 2024",Pseudo R-squ.:,0.01926
Time:,22:09:28,Log-Likelihood:,-373130.0
converged:,True,LL-Null:,-380460.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,113.1000,,,,,
score,-0.0012,0.002,-0.651,0.515,-0.005,0.002
ups,-0.0124,0.002,-6.292,0.000,-0.016,-0.009
downs,0.2853,0.011,25.194,0.000,0.263,0.307
year,-0.0566,0.002,-23.489,0.000,-0.061,-0.052
SUB_2007scape,1.1100,,,,,
SUB_AdviceAnimals,1.2114,,,,,
SUB_Android,1.0090,,,,,
SUB_AskMen,0.7872,,,,,


Off first glance we see many null values for the subreddits. Furthermore, we can see a low R squared value. On the other hand, we see a low P value, thus statiscal significance, with ups, downs, and year

Let's see how our predictions are with this model. We will base anything over 0.5 to be 1 (sarcastic) and all else 0 (not sarcastic).

In [33]:
model_predictions_prob = mylogreg_results.predict(X_withconstant)
model_predictions_binary = np.where(model_predictions_prob>0.5,1,0)

In [34]:
(model_predictions_binary == df_2['label']).sum()

313634

In [35]:
len(df_2['label'])

549904

In [36]:
print("The accuracy is:", (313634/549904)*100 , "%")

The accuracy is: 57.03431871744886 %


Our initial model is pretty bad, with 57%. Let's see if we can improve upon this. For this next one, why don't we try working with columns that had at least 0.015 correlation (positive or negative)

In [38]:
df_3 = df_2[['label', 'ups', 'downs', 'year', 'score', 'SUB_AskReddit', 'SUB_atheism', 'SUB_aww', 
            'SUB_creepyPMs', 'SUB_fantasyfootball', 'SUB_news', 'SUB_worldnews',
            'SUB_MensRights']]

In [39]:
df_3.head()

Unnamed: 0,label,ups,downs,year,score,SUB_AskReddit,SUB_atheism,SUB_aww,SUB_creepyPMs,SUB_fantasyfootball,SUB_news,SUB_worldnews,SUB_MensRights
0,0.0,-1.0,-1.0,2016.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,-1.0,-1.0,2016.0,-4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,3.0,0.0,2016.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,-1.0,-1.0,2016.0,-8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,2016.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now let's repeat the process. Defining X and y. Adding the constant. Fitting the model.

In [40]:
X = df_3[df_3.columns[1:]]
y = df_3['label']

X_withconstant_2 = sm.add_constant(X)

In [41]:
mylogreg_2= sm.Logit(y,X_withconstant_2)
mylogreg_2_results = mylogreg_2.fit()
mylogreg_2_results.summary()

Optimization terminated successfully.
         Current function value: 0.682660
         Iterations 5


0,1,2,3
Dep. Variable:,label,No. Observations:,549904.0
Model:,Logit,Df Residuals:,549891.0
Method:,MLE,Df Model:,12.0
Date:,"Tue, 02 Jan 2024",Pseudo R-squ.:,0.01331
Time:,22:12:23,Log-Likelihood:,-375400.0
converged:,True,LL-Null:,-380460.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,101.3647,4.618,21.950,0.000,92.314,110.416
ups,-0.0122,0.002,-6.220,0.000,-0.016,-0.008
downs,0.2791,0.011,24.909,0.000,0.257,0.301
year,-0.0502,0.002,-21.909,0.000,-0.055,-0.046
score,0.0006,0.002,0.305,0.761,-0.003,0.004
SUB_AskReddit,-0.5150,0.009,-58.956,0.000,-0.532,-0.498
SUB_atheism,0.3789,0.025,15.099,0.000,0.330,0.428
SUB_aww,-0.8627,0.045,-19.256,0.000,-0.951,-0.775
SUB_creepyPMs,1.1423,0.036,31.814,0.000,1.072,1.213


These results are interesting. The R Squared is still small, but now all columns (except score) have a lot P value, showing statistical significance. Also there are no null values. Now, let's check the accuracy


In [42]:
model_predictions_prob_2 = mylogreg_2_results.predict(X_withconstant_2)
model_predictions_binary_2 = np.where(model_predictions_prob_2>0.5,1,0)

In [43]:
(model_predictions_binary_2 == df_3['label']).sum()

305322

In [44]:
len(df_3['label'])

549904

In [45]:
print("The accuracy is:", (305322/549904)*100 , "%")

The accuracy is: 55.52278215834037 %


Hmm. Unfortunately, our model is even worse. Back to the drawing board.