## Predict Student Performance from Game Play

Online educational games can collect a wide variety of data about student performance, such as time spent playing, progress through game levels, scores, and quiz results. This data can be used to identify patterns and predict future performance. For example, if a student has consistently performed well on certain types of activities in the game, it's likely that they will continue to do so in the future.

By analyzing this data over time, educators can also track how student learning progresses over time. This information can be used to identify areas where students are struggling and provide additional support or resources. Similarly, data can be used to identify areas where students are excelling and to provide additional challenges or opportunities for growth.

Overall, data from online educational games can provide valuable insights into student performance and learning. By using this data effectively, educators can provide targeted and personalized support to help students succeed

# Data

In [None]:
import pandas as pd 
import numpy as np 

In [None]:
sample = pd.read_csv("/kaggle/input/gameplay/sample_submission.csv")


In [None]:
sample.head()

In [None]:
train = pd.read_csv("/kaggle/input/train-for-predict-gameplay/train.csv")
test = pd.read_csv("/kaggle/input/gameplay/test.csv")


In [None]:
train_label = pd.read_csv("/kaggle/input/gameplay/train_labels.csv")

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train_label.head()

In [None]:
test['index'].unique

In [None]:
print(train.shape)
print(test.shape)


# review data 

In [None]:
import seaborn as sns 
import matplotlib.pyplot as plt

In [None]:
train['event_name'].unique

In [None]:
train['event_name'].describe

In [None]:
train['level_group'].unique

In [None]:
train['level_group'].describe

- there was 2 type of main even auto navigation and manualy click 
- level for finish the game from 0 until 22 like [0-4 stage ] - [13-22 stage]

In [None]:
train['session_id'].describe

## check valuable columns and missing value

In [None]:
import missingno as msno

In [None]:
msno.bar(train)

In [None]:
msno.bar(test)

- there was no value at all for column fullscrene hq music so we doesn't need this data because can't representative  the data 

In [None]:
test.columns

In [None]:
test = test[['session_id', 'index', 'elapsed_time', 'event_name', 'name', 'level',
       'page', 'room_coor_x', 'room_coor_y', 'screen_coor_x', 'screen_coor_y',
       'hover_duration', 'text', 'fqid', 'room_fqid', 'text_fqid','level_group', 'session_level']]

In [None]:
msno.bar(test)

In [None]:
msno.bar(train)

In [None]:
train.columns

In [None]:
train = train[['session_id', 'index', 'elapsed_time', 'event_name', 'name', 'level',
       'page', 'room_coor_x', 'room_coor_y', 'screen_coor_x', 'screen_coor_y',
       'hover_duration', 'text', 'fqid', 'room_fqid', 'text_fqid','level_group']]

In [None]:
msno.bar(train)

# preprocessing

In [None]:
#preprocess 
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce 

#visualization 
import matplotlib.pyplot as plt
import seaborn as sns 

#encorsing 
from sklearn.compose import ColumnTransformer


In [None]:
test.columns

In [None]:
test['index'].unique

In [None]:
onehot = OneHotEncoder()
onehot_var = ['level']

In [None]:
transformer = ColumnTransformer([('one hot',onehot,onehot_var)],remainder= 'passthrough'

            )

In [None]:
transformer

In [None]:
test_encoded = pd.DataFrame(transformer.fit_transform(test))
test_encoded.columns= transformer.get_feature_names()
test_encoded

- visualisation student

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [None]:
test_encoded.columns

In [None]:
test.columns

In [None]:
train_label.columns

In [None]:
fig,ax =plt.subplots (figsize = (15,15))
sns.heatmap(train.corr(),annot = True , cmap = 'YlGnBu', linewidths= 5 , ax=ax)

In [None]:
plt.figure(figsize=(5,5))
plt.title('Correct_rate')
(train_label['correct'].value_counts(normalize=True)*100).plot.pie(labels = ["correct", "false"], autopct='%1.1f%%')

In [None]:
sns.barplot(x='page',y='index',data=train)

In [None]:
train.columns

In [None]:
sns.barplot (x='name',y='index',data=train)

In [None]:
sns.barplot(x='name',y='index',data=test_encoded)

In [None]:
sns.barplot(x='index',y='event_name',data=test)

In [None]:
sns.barplot(x='level',y='elapsed_time',data=test)


- from the diagram we can see that the audience or player can complete lv 1 with shorten time and take more time to complete lv 19 

In [None]:
score= train_label['correct'].sort_values( ascending=True)

In [None]:
fig = make_subplots(rows=1, cols=1)
fig.append_trace(go.Bar(
    x=test["level"],
    y=test["index"],
    name="trial and error",
), row=1, col=1)
fig.update_layout(height=600, width=900, title_text="Trial and error at every level")
fig.update_layout(
paper_bgcolor = " #f2f2f2",
plot_bgcolor = " yellowgreen",
font=dict(family='sans-serif',
                  color='black',
                  size=10),)
fig.show()

- from diagram we can see that the easyiest level that they can conquer at level 4 many player doesnt need trial and eror at this level but in other hand the hardest level was at lv 18 because many player try another chance for complete this level 

In [None]:
test_encoded.columns

In [None]:
test_encoded.rename(columns={"one hot__x0_0": "level 1", "one hot__x0_1": "level 2",
                                "one hot__x0_2":"level 3","one hot__x0_4":"level 4","one hot__x0_5":"level 5","one hot__x0_6":"level 6","one hot__x0_7":"level 7",
                                "one hot__x0_8":"level 8","one hot__x0_9":"level 9","one hot__x0_10":"level 10","one hot__x0_11":"level 11","one hot__x0_12":"level 12","one hot__x0_13":"level 13",
                                "one hot__x0_14":"level 14","one hot__x0_15":"level 15","one hot__x0_16":"level 16","one hot__x0_17":"level 17","one hot__x0_18":"level 18","one hot__x0_19":"level 19",
                                "one hot__x0_20":"level 20","one hot__x0_21":"level 21","one hot__x0_22":"level 22"
                                })

In [None]:
train_label

In [None]:
train_label['question'] = train_label['session_id'].apply(lambda x: int(x.split('_')[1][1:]))


In [None]:
train_label

In [None]:
mean_correct = train_label['correct'].sum()/212022

In [None]:
mean_correct

- the avarage of student that answer the question was 70 % 

In [None]:
sns.barplot(x='question',y='correct', data=train_label)

- from this diagram we can see that the question 13 was dificult for student because only 20 % STUDENT was correct at this question maybe if question number 13 must be include at the game for the tips we can replace to question 18 for the anticlimacs 

In [None]:
train['text']

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
word_sample = ['fun learn fun ' , 'car do all day' , 'hate feel ']
vect = CountVectorizer()
vect.fit(word_sample)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
def kamus(check):
    check = check.str.extractall('([a-zA_Z]+)')
    check.columns = ['check']
    b = check.reset_index(drop=True)
    check = b['check'].value_counts()

    kamus = {'word':check.index,'freq':check.values}
    kamus = pd.DataFrame(kamus)
    kamus.index = kamus['word']
    kamus.drop('word', axis = 1, inplace = True)
    kamus.sort_values('freq',ascending=False,inplace=True)
    
    return kamus

In [None]:
train['text']

In [None]:
df_kamus = kamus(test['text'])

In [None]:
df_kamus[:20].plot(kind='barh')

# there was frequens from students chat by text at sample 

In [None]:


def plot_cloud(wordcloud):
    # Set figure size
    plt.figure(figsize=(20, 10))
    # Display image
    plt.imshow(wordcloud) 
    # No axis details
    plt.axis("off")

In [None]:
from wordcloud import WordCloud

In [None]:
word_cloud = WordCloud().generate(str(test['text']))
plot_cloud(word_cloud)

-from the immage the bigger text was the most frequens at sample test students text at the game