# Week 6 Instructor-led Lab - Manipulating Data

**Intro to Python**  
**Manipulating Data**  
**Cody Thompson**  
**Date:** 4/14/2025

Welcome to my notebook for the Manipulating Data lab! In this notebook, we will be working with the `github_teams.csv` dataset to perform various data manipulations. These operations include accessing data, sorting, filtering, and creating new DataFrames based on specific conditions.


In [2]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

# Set Working Directory
os.chdir('C:\\Users\\cthom\\Downloads\\BGEN 632 Intro to Python\\GitHub_Repos\\Week 7\\week7labs\\data')

# Verify the current working directory
print("Current Working Directory:", os.getcwd())


Current Working Directory: C:\Users\cthom\Downloads\BGEN 632 Intro to Python\GitHub_Repos\Week 7\week7labs\data


# Acessing Data

In [3]:
#~~~~~~~~ Accessing Data ~~~~~~~

# Load the dataset into a DataFrame
df = pd.read_csv('github_teams.csv')

In [4]:
# Display the column names
print("Column Headers:", df.columns)

Column Headers: Index(['name_h', 'Team_type', 'Team_size_class', 'human_members_count',
       'bot_members_count', 'human_work', 'work_per_human', 'human_gini',
       'human_Push', 'human_IssueComments', 'human_PRReviewComment',
       'human_MergedPR', 'bot_work', 'bot_Push', 'bot_IssueComments',
       'bot_PRReviewComment', 'bot_MergedPR', 'eval_survival_day_median',
       'issues_count'],
      dtype='object')


In [5]:
# Get the number of columns in the DataFrame
num_columns = len(df.columns)
print("Number of Columns:", num_columns)

Number of Columns: 19


In [6]:
# Get the number of rows in the DataFrame
num_rows = len(df)
print("Number of Rows:", num_rows)

Number of Rows: 608


In [7]:
# Determine which columns are categorical and convert them from *object* to *category*
print("\nData Types of Columns:")
print(df.dtypes)


Data Types of Columns:
name_h                       object
Team_type                    object
Team_size_class              object
human_members_count           int64
bot_members_count             int64
human_work                    int64
work_per_human              float64
human_gini                  float64
human_Push                    int64
human_IssueComments           int64
human_PRReviewComment         int64
human_MergedPR                int64
bot_work                      int64
bot_Push                      int64
bot_IssueComments             int64
bot_PRReviewComment           int64
bot_MergedPR                  int64
eval_survival_day_median    float64
issues_count                float64
dtype: object


In [8]:
# Convert columns with object type to category (categorical columns)
categorical_columns = df.select_dtypes(include=['object']).columns
df[categorical_columns] = df[categorical_columns].apply(lambda x: x.astype('category'))
print("\nCategorical Columns converted to category:")
print(df.dtypes)


Categorical Columns converted to category:
name_h                      category
Team_type                   category
Team_size_class             category
human_members_count            int64
bot_members_count              int64
human_work                     int64
work_per_human               float64
human_gini                   float64
human_Push                     int64
human_IssueComments            int64
human_PRReviewComment          int64
human_MergedPR                 int64
bot_work                       int64
bot_Push                       int64
bot_IssueComments              int64
bot_PRReviewComment            int64
bot_MergedPR                   int64
eval_survival_day_median     float64
issues_count                 float64
dtype: object


In [9]:
# How many unique values does 'Team_type' have?
# Count unique values in 'Team_type' column
team_type_unique = df['Team_type'].nunique()
print("\nUnique values in 'Team_type':", team_type_unique)


Unique values in 'Team_type': 2


In [10]:
# How many unique values does 'Team_size_class' have?
# Count unique values in 'Team_size_class' column
team_size_class_unique = df['Team_size_class'].nunique()
print("\nUnique values in 'Team_size_class':", team_size_class_unique)


Unique values in 'Team_size_class': 3


In [11]:
# What is the value of the 63rd row and 6th column?
# Access the value in the 63rd row and 6th column
value_63_6 = df.iloc[62, 5]  # 0-indexed, so row 63 is index 62, column 6 is index 5
print("\nValue of the 63rd row and 6th column:", value_63_6)


Value of the 63rd row and 6th column: 35


In [12]:
# What are the values for the 300th row?
# Access the 300th row
values_300 = df.iloc[299]  # 0-indexed, so row 300 is index 299
print("\nValues for the 300th row:")
print(values_300)


Values for the 300th row:
name_h                      IyfocAGfAHLncCVJUujqTA/A_QZ6HlUb5sRQHhPa7SGzQ
Team_type                                                       human-bot
Team_size_class                                                    Medium
human_members_count                                                     4
bot_members_count                                                       1
human_work                                                           1049
work_per_human                                                     262.25
human_gini                                                       0.448761
human_Push                                                            739
human_IssueComments                                                   213
human_PRReviewComment                                                  91
human_MergedPR                                                          6
bot_work                                                               52
bot_Push   

In [13]:
# Select row with index value 595 with 1st, 2nd, 3rd columns using three methods
# Method 1: Using .iloc
row_595_1 = df.iloc[594, [0, 1, 2]]  # 0-indexed, so row 595 is index 594
print("\nRow 595 with 1st, 2nd, 3rd columns (Method 1):")
print(row_595_1)


Row 595 with 1st, 2nd, 3rd columns (Method 1):
name_h             Z7pZbUnKgDYaYhaLIGmLpw/O92cd-KLiFkunriLmVErdw
Team_type                                              human-bot
Team_size_class                                            Small
Name: 594, dtype: object


In [14]:
# Method 2: Using .loc
row_595_2 = df.loc[595, ['name_h', 'Team_type', 'Team_size_class']]  # Correct column name is 'name_h'
print("\nRow 595 with 1st, 2nd, 3rd columns (Method 2):")
print(row_595_2)


Row 595 with 1st, 2nd, 3rd columns (Method 2):
name_h             zAh1NECRCquqUJ_-1d6hAw/DET3jTK8hokYfY_neJ1IVQ
Team_type                                              human-bot
Team_size_class                                            Small
Name: 595, dtype: object


In [15]:
# Method 3: Using direct column indexing
row_595_3 = df[['name_h', 'Team_type', 'Team_size_class']].iloc[594]
print("\nRow 595 with 1st, 2nd, 3rd columns (Method 3):")
print(row_595_3)


Row 595 with 1st, 2nd, 3rd columns (Method 3):
name_h             Z7pZbUnKgDYaYhaLIGmLpw/O92cd-KLiFkunriLmVErdw
Team_type                                              human-bot
Team_size_class                                            Small
Name: 594, dtype: object


In [16]:
# Select the row with index value 46 with the 3rd and 7th columns using two methods
# Method 1: Using .iloc
row_46_1 = df.iloc[45, [2, 6]]  # 0-indexed, so row 46 is index 45, column 3 is index 2, column 7 is index 6
print("\nRow 46 with 3rd and 7th columns (Method 1):")
print(row_46_1)



Row 46 with 3rd and 7th columns (Method 1):
Team_size_class    Small
work_per_human      14.0
Name: 45, dtype: object


In [17]:
# Method 2: Using .loc
row_46_2 = df.loc[46, ['Team_type', 'bot_members_count']] 
print("\nRow 46 with 3rd and 7th columns (Method 2):")
print(row_46_2)


Row 46 with 3rd and 7th columns (Method 2):
Team_type            human
bot_members_count        0
Name: 46, dtype: object


# Sorting and Ordering Data

In [18]:
#~~~~~~~~~~~ Sorting and Ordering data  ~~~~~~~~~~~~~

# Select 'human-bot' teams with 'bot_members_count' >= 2
human_bot_teams = df[(df['Team_type'] == 'human-bot') & (df['bot_members_count'] >= 2)]
print("\n'human-bot' teams with 'bot_members_count' >= 2:")
print(human_bot_teams)


'human-bot' teams with 'bot_members_count' >= 2:
                                            name_h  Team_type Team_size_class  \
3    _l5u7I5p4thtW5SjR_9_4w/aZNCdVXta7fh7eCMzZP1CA  human-bot           Large   
4    _l5u7I5p4thtW5SjR_9_4w/m_FpD7PKQHqVXHn2bh7u2g  human-bot           Large   
42   2-scMrZv13F95YPZmfieww/4Zc56iUYjIZrZU06omFrJw  human-bot           Large   
84   4YoH8row044yJjPIqWJw9Q/NSXj3i61X71lV0StTN71Ww  human-bot           Small   
89   5Is-_ie16OEGmW1arZm8qg/8UeSk2P76pTG7pPLtxsHTQ  human-bot           Large   
110  7sA-8-nyqr0Ri2CT4-FSZw/GJPQoUhHfvUsxKcdkHWLEw  human-bot           Small   
146  bi5TY2Z4OSQq3PMs6JnKYA/5wtZcUUo1XmLHIra8NDtFQ  human-bot          Medium   
147  bi5TY2Z4OSQq3PMs6JnKYA/9b9IqkDK14ketwn88f3hKA  human-bot           Small   
149  bi5TY2Z4OSQq3PMs6JnKYA/kIiAIJpk6lOa6Nxf234KkQ  human-bot           Small   
224  FAhkB4rsocfDW0vrM8U8NA/3KHgTzOwWtAxTXlp_mbqoA  human-bot           Large   
229  FAhkB4rsocfDW0vrM8U8NA/Tl_ZLGwQZrAi-GHyEKl_jA  human-b

In [21]:
# Find 'human' teams that are 'Large' and have 'human_gini' >= 0.75
human_large_teams = df[(df['Team_type'] == 'human') & (df['Team_size_class'] == 'Large') & (df['human_gini'] >= 0.75)]
print("\n'human' teams that are 'Large' and have 'human_gini' >= 0.75:")
print(human_large_teams)


'human' teams that are 'Large' and have 'human_gini' >= 0.75:
                                            name_h Team_type Team_size_class  \
138  ASYGR96YA91p3z7MNKjZCA/IB2pZ8ygcvNnlxUdysjSFA     human           Large   
285  IiUao8vA_zm_uEIVVLI-Sw/91ya8vlSP8qgwCllH_6BSw     human           Large   
505  uLHPO58cQefwrJUbyhYOKQ/7YWOP8uDEeKDHQMWKqOoYA     human           Large   
582  y8Jw59EHVSrsluSuhR5okg/V5vb074jNkzg4YCKforX1Q     human           Large   

     human_members_count  bot_members_count  human_work  work_per_human  \
138                   12                  0        1655      137.916667   
285                   25                  0        3599      143.960000   
505                   48                  0        5748      119.750000   
582                    8                  0         277       34.625000   

     human_gini  human_Push  human_IssueComments  human_PRReviewComment  \
138    0.799446         793                  684                    178   
285    0.8

In [22]:
# Count teams in the 'Small' or 'Large' category
small_large_teams_count = df[df['Team_size_class'].isin(['Small', 'Large'])].shape[0]
print("\nNumber of teams in the 'Small' or 'Large' category:", small_large_teams_count)


Number of teams in the 'Small' or 'Large' category: 428


In [23]:
# Count teams in the 'Small' or 'Large' category with 'human_gini' <= 0.20
small_large_gini_count = df[(df['Team_size_class'].isin(['Small', 'Large'])) & (df['human_gini'] <= 0.20)].shape[0]
print("\nNumber of teams in the 'Small' or 'Large' category with 'human_gini' <= 0.20:", small_large_gini_count)


Number of teams in the 'Small' or 'Large' category with 'human_gini' <= 0.20: 66


In [24]:
# Count 'human-bot' teams in the 'Medium' category
human_bot_medium_teams_count = df[(df['Team_type'] == 'human-bot') & (df['Team_size_class'] == 'Medium')].shape[0]
print("\nNumber of 'human-bot' teams in the 'Medium' category:", human_bot_medium_teams_count)


Number of 'human-bot' teams in the 'Medium' category: 84


In [25]:
# Create a subsample of 50% of your data
subsample_50 = df.sample(frac=0.5, random_state=42)
print("\n50% subsample of data:")
print(subsample_50.head())


50% subsample of data:
                                            name_h  Team_type Team_size_class  \
109  7R4RlGuxT82BQcun7raI2A/qMXTUX8UDZFqbyxir5YUaA      human           Large   
10   -2eEmMGH_9GMVVn0WImTyA/fqB6po6-zxOceVyOZhgyNg  human-bot           Large   
184  DevSiHbJAM-XpY7qknTWIA/ZQC28gcQYc30ailijMZxIg      human          Medium   
77   4bzOzx-_iEqw2tTZAGgCBQ/o5j7d9925hU99XzWX9CnxQ      human           Small   
538  vpAJthlySeoTSTCzS0iH9w/WC1ULBuXAws4mFhHGkKRrg  human-bot           Small   

     human_members_count  bot_members_count  human_work  work_per_human  \
109                    9                  0         344       38.222222   
10                    16                  1        5105      319.062500   
184                    5                  0         367       73.400000   
77                     2                  0          93       46.500000   
538                    3                  1         303      101.000000   

     human_gini  human_Push  human_Iss

In [26]:
# Create samples for an 8-fold cross-validation test
cross_val_samples = [df.sample(frac=1/8, random_state=i) for i in range(8)]
print("\n8-fold cross-validation samples:")
for i, sample in enumerate(cross_val_samples):
    print(f"\nFold {i+1}:")
    print(sample.head())


8-fold cross-validation samples:

Fold 1:
                                            name_h  Team_type Team_size_class  \
576  xWubIgj2VILg4snlrzsM4w/UalqK3B7G5LNe6qRInBKHw      human           Small   
52   35V6uT_BzwKvPZ1wkdVc9g/GGqiclKUTLWu2NNVmV5H3g  human-bot           Small   
531  vihAQ-l2aJ5v4UYrXYRGiw/DEmu3nlEY-yVVBpaZ4dG_Q      human           Large   
345  MC6oqT7o22Y_rULWJZllfA/MXyVzmYYom7cgybNB0CjFQ  human-bot           Large   
55   3kxNLliWxuuskyxXPMhm9w/cUJHhJ8OwSG-iDYoeBPbFw  human-bot           Small   

     human_members_count  bot_members_count  human_work  work_per_human  \
576                    3                  0          46       15.333333   
52                     2                  1         118       59.000000   
531                    7                  0          91       13.000000   
345                    7                  2        1421      203.000000   
55                     2                  1         187       93.500000   

     human_gini  hu

In [27]:
# Select columns that are numeric and save as a new DataFrame
numeric_columns_df = df.select_dtypes(include=['number'])
print("\nNumeric columns DataFrame:")
print(numeric_columns_df.head())


Numeric columns DataFrame:
   human_members_count  bot_members_count  human_work  work_per_human  \
0                    2                  1          66       33.000000   
1                    2                  0          62       31.000000   
2                    7                  0         211       30.142857   
3                  234                 12       14579       62.303419   
4                   38                  8        1625       42.763158   

   human_gini  human_Push  human_IssueComments  human_PRReviewComment  \
0    0.287879          29                   33                      4   
1    0.467742          62                    0                      0   
2    0.499661         194                   16                      1   
3    0.738342        1942                11430                   1170   
4    0.666607         203                 1270                    152   

   human_MergedPR  bot_work  bot_Push  bot_IssueComments  bot_PRReviewComment  \
0            

In [28]:
# Remove the columns 'bot_PRReviewComment' and 'bot_MergedPR'
df_dropped = df.drop(columns=['bot_PRReviewComment', 'bot_MergedPR'])
print("\nDataFrame after removing 'bot_PRReviewComment' and 'bot_MergedPR':")
print(df_dropped.head())


DataFrame after removing 'bot_PRReviewComment' and 'bot_MergedPR':
                                          name_h  Team_type Team_size_class  \
0  _1bqaxzCk0sfQaunsjeViQ/RCEZ3CASdLXbstu9y2JQ7Q  human-bot           Small   
1  _9o07rGiC7DFyi-zm91Q0g/VOgMsrjYEwFAq0BY8kHqGQ      human           Small   
2  _DzK53uaZXnAX3WcC0W28g/Epc4QWw5PNBQIIdvopEHDA      human           Large   
3  _l5u7I5p4thtW5SjR_9_4w/aZNCdVXta7fh7eCMzZP1CA  human-bot           Large   
4  _l5u7I5p4thtW5SjR_9_4w/m_FpD7PKQHqVXHn2bh7u2g  human-bot           Large   

   human_members_count  bot_members_count  human_work  work_per_human  \
0                    2                  1          66       33.000000   
1                    2                  0          62       31.000000   
2                    7                  0         211       30.142857   
3                  234                 12       14579       62.303419   
4                   38                  8        1625       42.763158   

   human_gini  hum

In [29]:
# Save the columns 'Team_size_class' and 'human_members_count' as a new DataFrame
team_size_gini_df = df[['Team_size_class', 'human_members_count']].copy()

In [30]:
# Rename these two columns in the new DataFrame
team_size_gini_df.rename(columns={'Team_size_class': 'Size Class', 'human_members_count': 'Human Members Count'}, inplace=True)
print("\nRenamed DataFrame:")
print(team_size_gini_df.head())


Renamed DataFrame:
  Size Class  Human Members Count
0      Small                    2
1      Small                    2
2      Large                    7
3      Large                  234
4      Large                   38
