<a href="https://colab.research.google.com/github/SHodapp117/Applied-Machine-Learning/blob/main/Shodapp_baselinev1_LRmodel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# @title Setup
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'dapperlabs-data' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()

## Reference SQL syntax from the original job
Use the ```jobs.query```
[method](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query) to
return the SQL syntax from the job. This can be copied from the output cell
below to edit the query now or in the future. Alternatively, you can use
[this link](https://console.cloud.google.com/bigquery?j=dapperlabs-data:US:bquxjob_16ea2d80_18b91392c05)
back to BigQuery to edit the query within the BigQuery user interface.

In [None]:
# Running this code will display the query used to generate your previous job

job = client.get_job('bquxjob_16ea2d80_18b91392c05') # Job ID inserted based on the query results selected to explore
print(job.query)

WITH listing_history as (
  select *
  from `dapperlabs-data.berkeley_ds_sandbox.berkeley_ds_source_nfl_historical_listings_time_series`
),
-- get unique rows for every week in last 6m per moment
weekly_series as (
    select distinct a.flow_moment_id, a.moment_flow_edition_id, date_trunc(date, week) as week
    from listing_history as a,
        unnest(generate_date_array(date_trunc(date_sub(date_trunc(current_date, month), interval 6 month), week), date_sub(date_trunc(current_date(), week), interval 3 week), interval 1 day)) as date
),
--- weekly avg of sold listings
sold_avg AS (
  SELECT w.week, w.flow_moment_id, w.moment_flow_edition_id, AVG(l.listing_price_usd) AS avg_sold
  FROM weekly_series as w
  LEFT JOIN (select * from listing_history where listing_status = 'SOLD')  as l
    on w.flow_moment_id = l.flow_moment_id
      and date_trunc(l.event_timestamp, week) = w.week
  GROUP BY w.week, w.flow_moment_id, w.moment_flow_edition_id
),
-- weekly avg of non sol

# Result set loaded from BigQuery job as a DataFrame
Query results are referenced from the Job ID ran from BigQuery and the query
does not need to be re-run to explore results. The ```to_dataframe```
[method](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob.to_dataframe)
downloads the results to a Pandas DataFrame by using the BigQuery Storage API.

To edit query syntax, you can do so from the BigQuery SQL editor or in the
```Optional:``` sections below.

In [None]:
# Running this code will read results from your previous job

job = client.get_job('bquxjob_16ea2d80_18b91392c05') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results.head()

Unnamed: 0,week,flow_moment_id,moment_flow_edition_id,avg_sold,avg_listed,edition_floor_listed,moment_value_current_week,target_moment_value_next_week,moment_play_player_position,position_QB,position_RB,position_WR,position_TE,position_LB,position_DL,position_DB,rarity,final_player_score,serial_to_mint_ratio,listed_supply
0,2023-04-30,1217355,573,2.0,3.000000,1.0,2.0,1.0,WR,0,0,1,0,0,0,0,1.00000,0.058292,0.490000,86
1,2023-04-30,1309920,606,92.0,211.000000,80.0,92.0,92.0,QB,1,0,0,0,0,0,0,0.01250,0.563732,0.231368,27
2,2023-04-30,1562186,637,32.0,163.000000,32.0,32.0,35.0,TE,0,0,0,1,0,0,0,0.03125,0.039645,0.233528,2
3,2023-04-30,1570346,644,329.0,339.000000,329.0,329.0,329.0,WR,0,0,1,0,0,0,0,0.00304,0.040034,0.796610,8
4,2023-04-30,1663283,681,12.0,12.833333,5.0,12.0,12.0,WR,0,0,1,0,0,0,0,0.20000,0.086319,0.065100,141
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5461,2023-10-08,968842,532,1.0,2.000000,1.0,1.0,,QB,1,0,0,0,0,0,0,1.00000,0.570914,0.355000,4
5462,2023-10-08,974470,533,2.0,3.000000,2.0,2.0,,QB,1,0,0,0,0,0,0,0.50000,0.487371,0.058500,3
5463,2023-10-08,974888,533,2.0,3.000000,2.0,2.0,,QB,1,0,0,0,0,0,0,0.50000,0.487371,0.110750,3
5464,2023-10-08,983721,534,2.0,3.000000,2.0,2.0,,WR,0,0,1,0,0,0,0,0.50000,0.063846,0.214875,3


## Show descriptive statistics using describe()
Use the ```pandas DataFrame.describe()```
[method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
to generate descriptive statistics. Descriptive statistics include those that
summarize the central tendency, dispersion and shape of a dataset’s
distribution, excluding ```NaN``` values. You may also use other Python methods
to interact with your data.

In [None]:
results.describe()
df = results[['position_QB','position_RB',	'position_WR',	'position_TE',	'position_LB',	'position_DL',	'position_DB',	'rarity',	'final_player_score',	'serial_to_mint_ratio',	'listed_supply','avg_sold',	'avg_listed',	'edition_floor_listed',	'moment_value_current_week',	'target_moment_value_next_week']]
df = df.astype(float)
column_dtypes = df.dtypes
print(column_dtypes)

position_QB                      float64
position_RB                      float64
position_WR                      float64
position_TE                      float64
position_LB                      float64
position_DL                      float64
position_DB                      float64
rarity                           float64
final_player_score               float64
serial_to_mint_ratio             float64
listed_supply                    float64
avg_sold                         float64
avg_listed                       float64
edition_floor_listed             float64
moment_value_current_week        float64
target_moment_value_next_week    float64
dtype: object


In [None]:
numerical_features = ['final_player_score', 'serial_to_mint_ratio', 'listed_supply', 'avg_sold', 'avg_listed', 'moment_value_current_week','rarity']
import pandas as pd
import numpy as np

# Define the threshold percentile (e.g., 75th percentile)
percentile_threshold = 0.75

# Create a list to store the filtered dataframes for each numerical column
filtered_dataframes = []

# Iterate over your numerical features
for numerical_features in df:
    # Calculate the percentile for the current numerical column
    percentile_value = np.percentile(df[numerical_features], percentile_threshold)

    # Filter the data to keep only data points above the percentile value for the current column
    filtered_data = df[df[numerical_features] >= percentile_value]

    # Append the filtered dataframe to the list
    filtered_dataframes.append(filtered_data)

# Now, you have a list of dataframes, each containing the data points above the 75th percentile for a specific numerical feature.

# If you want to combine these dataframes, you can do so using pandas.concat, for example:
filtered_data_combined = pd.concat(filtered_dataframes, axis=1)

# You can then use filtered_data_combined to create your training and testing datasets.
filtered_data_combined = filtered_data_combined.iloc[:, :16]
filtered_data_combined.head()

Unnamed: 0,position_QB,position_RB,position_WR,position_TE,position_LB,position_DL,position_DB,rarity,final_player_score,serial_to_mint_ratio,listed_supply,avg_sold,avg_listed,edition_floor_listed,moment_value_current_week,target_moment_value_next_week
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.058292,0.49,86.0,2.0,3.0,1.0,2.0,1.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0125,0.563732,0.231368,27.0,92.0,211.0,80.0,92.0,92.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.03125,0.039645,0.233528,2.0,32.0,163.0,32.0,32.0,35.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.00304,0.040034,0.79661,8.0,329.0,339.0,329.0,329.0,329.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.2,0.086319,0.0651,141.0,12.0,12.833333,5.0,12.0,12.0


In [None]:
filtered_data_combined.fillna(filtered_data_combined.mean())
filtered_data_combined.describe()

Unnamed: 0,position_QB,position_RB,position_WR,position_TE,position_LB,position_DL,position_DB,rarity,final_player_score,serial_to_mint_ratio,listed_supply,avg_sold,avg_listed,edition_floor_listed,moment_value_current_week,target_moment_value_next_week
count,5466.0,5466.0,5466.0,5466.0,5466.0,5466.0,5466.0,5466.0,5466.0,5466.0,5466.0,5466.0,5466.0,5466.0,5466.0,5015.0
mean,0.232345,0.23509,0.402122,0.122576,0.007867,0.0,0.0,0.626961,0.185612,0.443261,38.697585,21.5152,644.847291,17.193377,21.5152,29.48848
std,0.422367,0.424094,0.490371,0.32798,0.088354,0.0,0.0,0.385199,0.207167,0.294918,35.229264,109.267671,19929.08114,91.73115,109.267671,146.12531
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000357,0.004261,0.0001,1.0,1.0,1.5,1.0,1.0,1.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.045477,0.184249,13.0,1.0,3.0,1.0,1.0,2.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.081239,0.400464,29.0,2.0,4.0,2.0,2.0,3.0
75%,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.194607,0.693326,52.0,5.0,15.0,3.0,5.0,8.0
max,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.703632,1.0,165.0,3000.0,1000000.0,2800.0,3000.0,3199.0


In [None]:
# Training test split
features = ['position_QB', 'position_RB', 'position_WR', 'position_TE', 'position_LB', 'position_DL', 'position_DB', 'rarity', 'final_player_score',
            'serial_to_mint_ratio', 'listed_supply', 'avg_sold', 'avg_listed', 'moment_value_current_week']

# Use a ~80/20 train/test split.
moment_train = filtered_data_combined[:4000]
moment_test = filtered_data_combined[4000:5000]

# Create separate variables for features (inputs) and labels (outputs).
# We will be using these in the cells below.
moment_train_features = moment_train[features]
moment_test_features = moment_test[features]
moment_train_labels = moment_train['target_moment_value_next_week']
moment_test_labels = moment_test['target_moment_value_next_week']

# Confirm the data shapes are as expected.
print('train data shape:', moment_train_features.shape)
print('train labels shape:', moment_train_labels.shape)
print('test data shape:', moment_test_features.shape)
print('test labels shape:', moment_test_labels.shape)


train data shape: (4000, 14)
train labels shape: (4000,)
test data shape: (1000, 14)
test labels shape: (1000,)


In [None]:
from sklearn.preprocessing import MinMaxScaler

# List of numerical features to be scaled
numerical_features = ['final_player_score', 'serial_to_mint_ratio', 'listed_supply', 'avg_sold', 'avg_listed', 'moment_value_current_week', 'rarity']

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the scaler on the training data
moment_train_features[numerical_features] = scaler.fit_transform(moment_train_features[numerical_features])

# Transform the test data using the same scaler
moment_test_features[numerical_features] = scaler.transform(moment_test_features[numerical_features])

# Check the scaled data
print('Min-Max scaled train data:')
print(moment_train_features.head())

print('Min-Max scaled test data:')
print(moment_test_features.head())


Min-Max scaled train data:
   position_QB  position_RB  position_WR  position_TE  position_LB  \
0          0.0          0.0          1.0          0.0          0.0   
1          1.0          0.0          0.0          0.0          0.0   
2          0.0          0.0          0.0          1.0          0.0   
3          0.0          0.0          1.0          0.0          0.0   
4          0.0          0.0          1.0          0.0          0.0   

   position_DL  position_DB    rarity  final_player_score  \
0          0.0          0.0  1.000000            0.077258   
1          0.0          0.0  0.012147            0.799962   
2          0.0          0.0  0.030904            0.050594   
3          0.0          0.0  0.002683            0.051150   
4          0.0          0.0  0.199714            0.117332   

   serial_to_mint_ratio  listed_supply  avg_sold  avg_listed  \
0              0.489949       0.518293  0.000333    0.000002   
1              0.231291       0.158537  0.030343    0.000

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  moment_train_features[numerical_features] = scaler.fit_transform(moment_train_features[numerical_features])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  moment_test_features[numerical_features] = scaler.transform(moment_test_features[numerical_features])


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer

# Select your target variable and features
target_variable = df['final_player_score']
features = df.drop('final_player_score', axis=1)

# Split your data into a training set and a testing set
X_train, X_test, y_train, y_test = train_test_split(features, target_variable, test_size=0.2, random_state=42)

# Impute missing values using mean imputation
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
predictions = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("Mean Squared Error:", mse)
print("R-squared:", r2)


Mean Squared Error: 0.003927540032908037
R-squared: 0.9109064055618676
