<a href="https://colab.research.google.com/github/Aditi-dev07/Cisco_DataScience_Projects/blob/main/Data%20Cleaning/Typing%20speeds/typing-speeds.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Typing Speeds

How can you improve your typing speed?

The file `typing-speeds.csv` contains typing speed data from >168,000 people typing 15 sentences each. The data was collected via an online typing test published at a free typing speed assessment webpage.

In [1]:
# FOR GOOGLE COLAB ONLY.
# Uncomment and run the code below. A dialog will appear to upload files.
# Upload 'typing-speeds.csv'.

from google.colab import files
uploaded = files.upload()

Saving typing-speeds.csv to typing-speeds.csv


In [2]:
import pandas as pd

df = pd.read_csv('typing-speeds.csv')
df

Unnamed: 0,PARTICIPANT_ID,AGE,HAS_TAKEN_TYPING_COURSE,COUNTRY,LAYOUT,NATIVE_LANGUAGE,FINGERS,KEYBOARD_TYPE,ERROR_RATE,AVG_WPM_15,ROR
0,3,30,0,US,qwerty,en,1-2,full,0.511945,61.9483,0.2288
1,5,27,0,MY,qwerty,en,7-8,laptop,0.871080,72.8871,0.3675
2,7,13,0,AU,qwerty,en,7-8,laptop,6.685633,24.1809,0.0667
3,23,21,0,IN,qwerty,en,3-4,full,2.130493,24.7112,0.0413
4,24,21,0,PH,qwerty,tl,7-8,laptop,1.893287,45.3364,0.2678
...,...,...,...,...,...,...,...,...,...,...,...
168589,517932,20,0,US,qwerty,en,9-10,laptop,8.731466,24.9125,0.1842
168590,517936,25,0,PL,qwerty,pl,9-10,laptop,0.000000,66.2946,0.0639
168591,517943,38,1,US,qwerty,en,9-10,laptop,0.147929,75.6713,0.2021
168592,517944,28,0,GB,qwerty,en,9-10,laptop,0.278552,91.7083,0.5133




| **Variable**             | **Description**                                                                 |
|--------------------------|---------------------------------------------------------------------------------|
| `PARTICIPANT_ID`         | Unique ID of the participant                                                   |
| `AGE`                    | Age of the participant                                                         |
| `HAS_TAKEN_TYPING_COURSE`| Whether the participant has taken a typing course (1 = Yes, 0 = No)            |
| `COUNTRY`                | Country of the participant                                                     |
| `LAYOUT`      		   | Keyboard layout used (QWERTY, AZERTY, or QWERTZ)                               |
| `NATIVE_LANGUAGE`        | Native language of the participant                                             |
| `FINGERS`                | Number of fingers used for typing (options: 1-2, 3-4, 5-6, 7-8, 9-10)          |
| `KEYBOARD_TYPE`          | Type of keyboard used (Full/desktop, laptop, small physical, or touch)         |
| `ERROR_RATE(%)`          | Uncorrected error rate (as a percentage)                                       |
| `AVG_WPM_15`             | Words per minute averaged over 15 typed sentences                              |
| `ROR`                    | Rollover ratio                                                                 |


### Project Ideas
- Remove unnecessary columns, such as PARTICIPANT_ID, to streamline the dataset.

- Rename columns (e.g `AVG_WPM_15` to `wpm`, `ROR` to `ror`, `HAS_TAKEN_TYPING_COURSE` to `course`) for brevity and clarity during analysis.

Finger Count Analysis
- Compare typing speeds across groups using different numbers of fingers, excluding the "10+" category for simplicity.

- Control for consistency by first filtering to similar `AGE`, `KEYBOARD_LAYOUT`, `NATIVE_LANGUAGE`, `KEYBOARD_TYPE`, and `HAS_TAKEN_TYPING_COURSE` values.

- Exclude participants with high error rates (ERROR_RATE > 3%) to focus on reliable data.

- Drop columns after filtering if they now only have a single value.

Rollover Ratio Analysis
- The Rollover Ratio (`ROR`) represents the proportion of keypresses where a new key is pressed before releasing the previous one.

- Compare typing speeds between participants with `ROR` ≤ 20% and those with `ROR` > 80%, keeping `AGE`, `KEYBOARD_TYPE`, `FINGERS`, and other variables constant.

Influence of Typing Course
- Compare typing speeds between participants with a typing course (`HAS_TAKEN_TYPING_COURSE` = 1) and without (`HAS_TAKEN_TYPING_COURSE` = 0), holding other variables such as `KEYBOARD_TYPE`, `AGE` range, and `FINGER_COUNT` constant.


#Cleaning Data


In [4]:
df = df.drop('PARTICIPANT_ID', axis=1)

In [7]:
df = df.rename(columns={'AVG_WPM_15': 'wpm', 'ROR': 'ror', 'HAS_TAKEN_TYPING_COURSE': 'course'})
df

Unnamed: 0,AGE,course,COUNTRY,LAYOUT,NATIVE_LANGUAGE,FINGERS,KEYBOARD_TYPE,ERROR_RATE,wpm,ror
0,30,0,US,qwerty,en,1-2,full,0.511945,61.9483,0.2288
1,27,0,MY,qwerty,en,7-8,laptop,0.871080,72.8871,0.3675
2,13,0,AU,qwerty,en,7-8,laptop,6.685633,24.1809,0.0667
3,21,0,IN,qwerty,en,3-4,full,2.130493,24.7112,0.0413
4,21,0,PH,qwerty,tl,7-8,laptop,1.893287,45.3364,0.2678
...,...,...,...,...,...,...,...,...,...,...
168589,20,0,US,qwerty,en,9-10,laptop,8.731466,24.9125,0.1842
168590,25,0,PL,qwerty,pl,9-10,laptop,0.000000,66.2946,0.0639
168591,38,1,US,qwerty,en,9-10,laptop,0.147929,75.6713,0.2021
168592,28,0,GB,qwerty,en,9-10,laptop,0.278552,91.7083,0.5133


#Finger Count Analysis
##1.Compare typing speeds across groups using different numbers of fingers, excluding the "10+" category for simplicity

In [11]:
typing_speed = df.groupby('FINGERS')['wpm'].mean()
filtered_typing_speed = typing_speed.drop('10+', errors='ignore')
display(filtered_typing_speed)

Unnamed: 0_level_0,wpm
FINGERS,Unnamed: 1_level_1
1-2,40.280812
3-4,41.004952
5-6,45.731789
7-8,50.057909
9-10,57.379572


##2.Control for consistency by first filtering to similar values

In [25]:
filtered_df = df[(df['AGE'] >= 20) & (df['AGE'] <= 30) &
                  (df['LAYOUT'] == 'qwerty') &
                  (df['NATIVE_LANGUAGE'] == 'en') &
                  (df['KEYBOARD_TYPE'] == 'laptop')&
                  (df['course'] == 0)]

display(filtered_df.head())
print(f"Shape of filtered DataFrame: {filtered_df.shape}")

Unnamed: 0,AGE,course,COUNTRY,LAYOUT,NATIVE_LANGUAGE,FINGERS,KEYBOARD_TYPE,ERROR_RATE,wpm,ror
1,27,0,MY,qwerty,en,7-8,laptop,0.87108,72.8871,0.3675
13,25,0,US,qwerty,en,1-2,laptop,3.183792,28.1308,0.1019
25,21,0,MY,qwerty,en,9-10,laptop,0.421941,38.6345,0.2632
28,20,0,US,qwerty,en,1-2,laptop,2.394366,30.1761,0.3059
32,23,0,IN,qwerty,en,3-4,laptop,0.938967,14.9863,0.0371


Shape of filtered DataFrame: (23360, 10)


##3.Exclude participants with high error rates (ERROR_RATE > 3%) to focus on reliable data.

In [28]:
exclude_p = filtered_df[filtered_df['ERROR_RATE'] < 3]
display(exclude_p.head())
print(f"Shape of filtered DataFrame: {exclude_p.shape}")

Unnamed: 0,AGE,course,COUNTRY,LAYOUT,NATIVE_LANGUAGE,FINGERS,KEYBOARD_TYPE,ERROR_RATE,wpm,ror
1,27,0,MY,qwerty,en,7-8,laptop,0.87108,72.8871,0.3675
25,21,0,MY,qwerty,en,9-10,laptop,0.421941,38.6345,0.2632
28,20,0,US,qwerty,en,1-2,laptop,2.394366,30.1761,0.3059
32,23,0,IN,qwerty,en,3-4,laptop,0.938967,14.9863,0.0371
38,21,0,PH,qwerty,en,1-2,laptop,2.047244,20.9079,0.0151


Shape of filtered DataFrame: (21479, 10)


##4.Drop columns after filtering if they now only have a single value.

In [30]:
for col in exclude_p.columns:
    if exclude_p[col].nunique() == 1:
        exclude_p = exclude_p.drop(columns=[col])

display(exclude_p.head())
print(f"Shape of filtered DataFrame after dropping single-value columns: {exclude_p.shape}")

Unnamed: 0,AGE,COUNTRY,FINGERS,ERROR_RATE,wpm,ror
1,27,MY,7-8,0.87108,72.8871,0.3675
25,21,MY,9-10,0.421941,38.6345,0.2632
28,20,US,1-2,2.394366,30.1761,0.3059
32,23,IN,3-4,0.938967,14.9863,0.0371
38,21,PH,1-2,2.047244,20.9079,0.0151


Shape of filtered DataFrame after dropping single-value columns: (21479, 6)


#Rollover Ratio Analysis

##5.Compare typing speeds between participants with ROR ≤ 20% and those with ROR > 80%, keeping AGE, KEYBOARD_TYPE, FINGERS, and other variables constant.

In [45]:
controlled_df = df.query(
    "AGE >= 20 and AGE <= 30 and "
    "KEYBOARD_TYPE == 'laptop' and "
    "FINGERS == '1-2'"
)


low_ror = controlled_df[controlled_df["ror"] <= 0.20]
high_ror = controlled_df[controlled_df["ror"] > 0.80]
wpm_ror_low = low_ror["wpm"].mean()
wpm_ror_high = high_ror["wpm"].mean()

print(f"WPM for ROR <= 20% (FINGERS='1-2'): {wpm_ror_low:.2f}")
print(f"WPM for ROR > 80% (FINGERS='1-2'): {wpm_ror_high:.2f}")

WPM for ROR <= 20% (FINGERS='1-2'): 32.12
WPM for ROR > 80% (FINGERS='1-2'): nan


#Influence of Typing Course

##7.Compare typing speeds between participants with a typing course (HAS_TAKEN_TYPING_COURSE = 1) and without (HAS_TAKEN_TYPING_COURSE = 0), holding other variables such as KEYBOARD_TYPE, AGE range, and FINGER_COUNT constant.

In [38]:
taken_course = df[(df['AGE'] >= 20) & (df['AGE'] <= 30) &
                  (df['LAYOUT'] == 'qwerty') &
                  (df['NATIVE_LANGUAGE'] == 'en') &
                  (df['KEYBOARD_TYPE'] == 'laptop')&
                  (df['course'] == 1)]
print(f"Shape of taken_course DataFrame: {taken_course.shape}")
display(taken_course.head())

not_taken_course = df[(df['AGE'] >= 20) & (df['AGE'] <= 30) &
                  (df['LAYOUT'] == 'qwerty') &
                  (df['NATIVE_LANGUAGE'] == 'en') &
                  (df['KEYBOARD_TYPE'] == 'laptop')&
                  (df['course'] == 0)]
print(f"Shape of not_taken_course DataFrame: {not_taken_course.shape}")
display(not_taken_course.head())

Shape of taken_course DataFrame: (8402, 10)


Unnamed: 0,AGE,course,COUNTRY,LAYOUT,NATIVE_LANGUAGE,FINGERS,KEYBOARD_TYPE,ERROR_RATE,wpm,ror
6,20,1,AF,qwerty,en,7-8,laptop,3.127715,9.9978,0.0049
23,25,1,US,qwerty,en,3-4,laptop,1.30597,14.6155,0.0446
33,22,1,US,qwerty,en,9-10,laptop,0.147059,63.8758,0.3232
50,21,1,PH,qwerty,en,5-6,laptop,0.34904,36.4503,0.2628
108,22,1,IN,qwerty,en,9-10,laptop,1.176471,48.9956,0.447


Shape of not_taken_course DataFrame: (23360, 10)


Unnamed: 0,AGE,course,COUNTRY,LAYOUT,NATIVE_LANGUAGE,FINGERS,KEYBOARD_TYPE,ERROR_RATE,wpm,ror
1,27,0,MY,qwerty,en,7-8,laptop,0.87108,72.8871,0.3675
13,25,0,US,qwerty,en,1-2,laptop,3.183792,28.1308,0.1019
25,21,0,MY,qwerty,en,9-10,laptop,0.421941,38.6345,0.2632
28,20,0,US,qwerty,en,1-2,laptop,2.394366,30.1761,0.3059
32,23,0,IN,qwerty,en,3-4,laptop,0.938967,14.9863,0.0371
