# Statistical Model

Define Variables and Hypotheses:  
- Formulate your hypotheses regarding the relationships between these variables. For example, you might hypothesize that higher media bias scores on certain topics are associated with specific election outcomes.   
Response is election counts by year, state and party normalized. Features are the sentiment scores by year and bias topic.In the case of Twitter data, the sentiment scores are by year, state, and bias topic.  
Hypothesis: Higher negative sentiment scores of a bias topic gives higher election count for Republican candidates over time.

Feature Engineering:  
- Create relevant features based on your hypotheses and domain knowledge. For example, you could calculate the average sentiment scores for each topic in both the media and Twitter data. You could also consider time-based features, such as the sentiment change over time or the frequency of media coverage on specific topics.
Need to create the average sentiment scores for each topic over time.

Statistical Analysis and Model Selection:  
- Conduct exploratory data analysis to gain insights into the relationships between variables. Visualize the data using plots and conduct statistical tests to identify correlations and patterns.  
- Choose appropriate statistical models that can capture the relationships between media bias, sentiment scores, election outcomes, and voter turnout. Some potential models to consider include regression models, time series analysis, or structural equation modeling.  
- Assess the assumptions of your chosen models and validate them using appropriate techniques, such as cross-validation or bootstrapping.  

Model Training and Evaluation:  
- Split your dataset into training and testing sets. Train your statistical models on the training data and evaluate their performance on the testing data.
- Use appropriate evaluation metrics to assess the predictive power and goodness-of-fit of your models. Adjust and refine your models as needed.  

Interpretation and Reporting:  
- Interpret the results of your statistical models, focusing on the relationships between media bias, sentiment scores, and election outcomes.
- Consider the limitations and potential biases of your study and discuss them in your report.
Clearly communicate your findings, including any significant relationships or insights, in a comprehensive report or presentation.

In [2]:
import pandas as pd
import statsmodels.api as sm

# Load the Twitter data from the "twitter" table in the "sentimentdb" database
twitter_data = pd.read_sql_query("SELECT * FROM twitter;", "postgresql://postgres:YourPassword@localhost:5432/sentimentdb")

# Load the election results data from the "results" table in the "electiondb" database
election_results = pd.read_sql_query("SELECT * FROM results;", "postgresql://postgres:YourPassword@localhost:5432/electiondb")

# Load the voter turnout data from the "voters" table in the "electiondb" database
voter_turnout = pd.read_sql_query("SELECT * FROM voters;", "postgresql://postgres:YourPassword@localhost:5432/electiondb")

# Merge the data frames based on common columns (Year and State)
merged_data = pd.merge(twitter_data, election_results, on=["Year", "State"])
merged_data = pd.merge(merged_data, voter_turnout, on=["Year", "State"])

# Perform regression analysis
X = merged_data[["Norm_Neg", "Norm_Pos"]]  # Independent variables (Twitter bias)
X = sm.add_constant(X)  # Add a constant term for the intercept
y = merged_data["Vote %"]  # Dependent variable (Election vote percentage)

model = sm.OLS(y, X)  # Ordinary Least Squares (OLS) regression
results = model.fit()  # Fit the model

# Print the regression results summary
print(results.summary())


                            OLS Regression Results                            
Dep. Variable:                 Vote %   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     5.968
Date:                Sat, 13 May 2023   Prob (F-statistic):            0.00256
Time:                        21:22:45   Log-Likelihood:             1.3243e+05
No. Observations:              690492   AIC:                        -2.649e+05
Df Residuals:                  690489   BIC:                        -2.648e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.4912      0.001    533.510      0.0

In [3]:
import pandas as pd
import statsmodels.api as sm

# Load the Twitter data from the "twitter" table in the "sentimentdb" database
twitter_data = pd.read_sql_query("SELECT * FROM twitter;", "postgresql://postgres:YourPassword@localhost:5432/sentimentdb")

# Load the election results data from the "results" table in the "electiondb" database
election_results = pd.read_sql_query("SELECT * FROM results;", "postgresql://postgres:YourPassword@localhost:5432/electiondb")

# Load the voter turnout data from the "voters" table in the "electiondb" database
voter_turnout = pd.read_sql_query("SELECT * FROM voters;", "postgresql://postgres:YourPassword@localhost:5432/electiondb")

# Merge the data frames based on common columns (Year and State)
merged_data = pd.merge(twitter_data, election_results, on=["Year", "State"])
merged_data = pd.merge(merged_data, voter_turnout, on=["Year", "State"])

# Calculate the normalized vote count
merged_data["Norm_Vote_Count"] = merged_data["Vote count"] / merged_data["Registered"]

# Perform regression analysis
X = merged_data[["Norm_Neg", "Norm_Pos"]]  # Independent variables (Twitter bias)
X = sm.add_constant(X)  # Add a constant term for the intercept
y = merged_data["Norm_Vote_Count"]  # Dependent variable (Normalized vote count)

model = sm.OLS(y, X)  # Ordinary Least Squares (OLS) regression
results = model.fit()  # Fit the model

# Print the regression results summary
print(results.summary())


                            OLS Regression Results                            
Dep. Variable:        Norm_Vote_Count   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     17.78
Date:                Sat, 13 May 2023   Prob (F-statistic):           1.90e-08
Time:                        21:26:44   Log-Likelihood:            -4.1107e+06
No. Observations:              690492   AIC:                         8.221e+06
Df Residuals:                  690489   BIC:                         8.221e+06
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         39.9685      0.429     93.067      0.0