# Data  Wrangling
---
Data Wrangling is a way of transforming raw data into a more readable form.

Examples:
1. Preparing data for input to a machine learning algorithm.
2. Analyzing algorithm output data: transform the data into more understandable formats for graphical uses, etc.

#### Establish a file path reference for the repo and user files:

In [1]:
repo_path = './04_repos-dump.csv'
users_path = './04_user-geocodes-dump.csv'

#### Pandas and Data Set import:

In [2]:
import pandas as pd

# Import Data
users = pd.read_csv(users_path, quotechar = '"', skipinitialspace = True)
repos = pd.read_csv(repo_path, quotechar = '"', skipinitialspace = True)

#### Repo Preview:

In [3]:
display(repos.head())

Unnamed: 0,full_name,stars,forks,description,language
0,thedaviddias/Front-End-Checklist,24267,2058,🗂 The perfect Front-End Checklist for modern w...,JavaScript
1,GoogleChrome/puppeteer,21976,1259,Headless Chrome Node API,JavaScript
2,parcel-bundler/parcel,13981,463,"📦🚀 Blazing fast, zero configuration web applic...",JavaScript
3,Chalarangelo/30-seconds-of-code,13466,1185,Curated collection of useful Javascript snippe...,JavaScript
4,wearehive/project-guidelines,11279,970,A set of best practices for JavaScript projects,JavaScript


#### Users Preview:

In [4]:
display(users.head())

Unnamed: 0,id,name,type,location,lat,long,city,country
0,shprink,Julien Renaux,User,"Toulouse, France",43.604652,1.444209,Toulouse,France
1,lllyasviel,,User,,,,,
2,arzzen,Lukáš Mešťan,User,"Zilina, Slovakia",49.21945,18.7408,Žilina,Slovakia
3,javierbyte,Javier Bórquez,User,"Guadalajara, MX",20.659699,-103.349609,Guadalajara,Mexico
4,nbarbettini,Nate Barbettini,User,"San Francisco, CA",37.774929,-122.419415,San Francisco,United States


#### Drop duplicate entries, split "full_name" into 2 columns, and rename "id" column to "users"
 - Drop duplicates from the repos list
 - Since the shape before and shape after are the same, there were no duplicates.

In [5]:
print(f"Shape before dropping duplicates: {repos.shape}\n")
repos = repos.drop_duplicates(subset='full_name', keep='last')

print(f"Shape after dropping duplicates: {repos.shape}\n")
display(repos.head())

Shape before dropping duplicates: (8697, 5)

Shape after dropping duplicates: (8697, 5)



Unnamed: 0,full_name,stars,forks,description,language
0,thedaviddias/Front-End-Checklist,24267,2058,🗂 The perfect Front-End Checklist for modern w...,JavaScript
1,GoogleChrome/puppeteer,21976,1259,Headless Chrome Node API,JavaScript
2,parcel-bundler/parcel,13981,463,"📦🚀 Blazing fast, zero configuration web applic...",JavaScript
3,Chalarangelo/30-seconds-of-code,13466,1185,Curated collection of useful Javascript snippe...,JavaScript
4,wearehive/project-guidelines,11279,970,A set of best practices for JavaScript projects,JavaScript


#### Divide the full_name into 2 separate columns:
 - in the repos list, the full_name format is currently: username/repo_name
 - Desired Output: 1st column is the user name, 2nd column is the repo name

In [6]:
def extract_user(line):
  return line.split('/')[0]

# Alternate:
#def extract_user(line):
#  line = line.split('/')
#  return line[0]

def extract_repo(line):
  return line.split('/')[1]

# This will parse and make 2 new columns:
repos['user'] = repos['full_name'].str[:].apply(extract_user)
repos['repo'] = repos['full_name'].str[:].apply(extract_repo)

print(f"Shape After Processing: {repos.shape}\n")
display(repos.head())

# Rows are the same as before, but 2 additional columns have been added

Shape After Processing: (8697, 7)



Unnamed: 0,full_name,stars,forks,description,language,user,repo
0,thedaviddias/Front-End-Checklist,24267,2058,🗂 The perfect Front-End Checklist for modern w...,JavaScript,thedaviddias,Front-End-Checklist
1,GoogleChrome/puppeteer,21976,1259,Headless Chrome Node API,JavaScript,GoogleChrome,puppeteer
2,parcel-bundler/parcel,13981,463,"📦🚀 Blazing fast, zero configuration web applic...",JavaScript,parcel-bundler,parcel
3,Chalarangelo/30-seconds-of-code,13466,1185,Curated collection of useful Javascript snippe...,JavaScript,Chalarangelo,30-seconds-of-code
4,wearehive/project-guidelines,11279,970,A set of best practices for JavaScript projects,JavaScript,wearehive,project-guidelines


#### Rename the `id` column to `user`

In [7]:
display(users.head())

users.rename(columns = {'id': 'user'}, inplace = True)  # renames column

display(users.head())

Unnamed: 0,id,name,type,location,lat,long,city,country
0,shprink,Julien Renaux,User,"Toulouse, France",43.604652,1.444209,Toulouse,France
1,lllyasviel,,User,,,,,
2,arzzen,Lukáš Mešťan,User,"Zilina, Slovakia",49.21945,18.7408,Žilina,Slovakia
3,javierbyte,Javier Bórquez,User,"Guadalajara, MX",20.659699,-103.349609,Guadalajara,Mexico
4,nbarbettini,Nate Barbettini,User,"San Francisco, CA",37.774929,-122.419415,San Francisco,United States


Unnamed: 0,user,name,type,location,lat,long,city,country
0,shprink,Julien Renaux,User,"Toulouse, France",43.604652,1.444209,Toulouse,France
1,lllyasviel,,User,,,,,
2,arzzen,Lukáš Mešťan,User,"Zilina, Slovakia",49.21945,18.7408,Žilina,Slovakia
3,javierbyte,Javier Bórquez,User,"Guadalajara, MX",20.659699,-103.349609,Guadalajara,Mexico
4,nbarbettini,Nate Barbettini,User,"San Francisco, CA",37.774929,-122.419415,San Francisco,United States


#### Merging The Datasets:
 - use `merge` function in pandas to merge the 2 datasets
 - merge `repos` and `users` according to the name of the user, according to the left value (repos).
 - There are 8697 repos and 6246 users.
 - Notice that the # of columns now is 14 since the datasets are merged:

In [8]:
repos_users = pd.merge(repos, users, on = 'user', how = 'left')

# 14 columns
print(f'Shape After Merge: {repos_users.shape}\n')
display(repos_users.head())

Shape After Merge: (8697, 14)



Unnamed: 0,full_name,stars,forks,description,language,user,repo,name,type,location,lat,long,city,country
0,thedaviddias/Front-End-Checklist,24267,2058,🗂 The perfect Front-End Checklist for modern w...,JavaScript,thedaviddias,Front-End-Checklist,David Dias,User,"France, Mauritius, Canada",,,,
1,GoogleChrome/puppeteer,21976,1259,Headless Chrome Node API,JavaScript,GoogleChrome,puppeteer,,Organization,,,,,
2,parcel-bundler/parcel,13981,463,"📦🚀 Blazing fast, zero configuration web applic...",JavaScript,parcel-bundler,parcel,Parcel,Organization,,,,,
3,Chalarangelo/30-seconds-of-code,13466,1185,Curated collection of useful Javascript snippe...,JavaScript,Chalarangelo,30-seconds-of-code,Angelos Chalaris,User,"Athens, Greece",37.98381,23.727539,Athens,Greece
4,wearehive/project-guidelines,11279,970,A set of best practices for JavaScript projects,JavaScript,wearehive,project-guidelines,Hive,Organization,London,51.507351,-0.127758,London,United Kingdom


#### Rearranging the columns of merged list: Reindexing

In [9]:
repos_users = repos_users.reindex(['user', 'repo', 'description', 'stars', 'forks', 'language', 'full_name', 'name', 'type', 'location', 'lat', 'long', 'city', 'country'], axis = 1)
# 14 columns

print(f'Shape after Reindexing: {repos_users.shape}\n')

display(repos_users.head())

Shape after Reindexing: (8697, 14)



Unnamed: 0,user,repo,description,stars,forks,language,full_name,name,type,location,lat,long,city,country
0,thedaviddias,Front-End-Checklist,🗂 The perfect Front-End Checklist for modern w...,24267,2058,JavaScript,thedaviddias/Front-End-Checklist,David Dias,User,"France, Mauritius, Canada",,,,
1,GoogleChrome,puppeteer,Headless Chrome Node API,21976,1259,JavaScript,GoogleChrome/puppeteer,,Organization,,,,,
2,parcel-bundler,parcel,"📦🚀 Blazing fast, zero configuration web applic...",13981,463,JavaScript,parcel-bundler/parcel,Parcel,Organization,,,,,
3,Chalarangelo,30-seconds-of-code,Curated collection of useful Javascript snippe...,13466,1185,JavaScript,Chalarangelo/30-seconds-of-code,Angelos Chalaris,User,"Athens, Greece",37.98381,23.727539,Athens,Greece
4,wearehive,project-guidelines,A set of best practices for JavaScript projects,11279,970,JavaScript,wearehive/project-guidelines,Hive,Organization,London,51.507351,-0.127758,London,United Kingdom


#### Ranking based on the "stars" column
 - adds a new column to the end without modifying the data

In [10]:
repos_users['rank'] = repos_users['stars'].rank(ascending = False)

print(repos_users.shape)  # 14 columns

display(repos_users.head())

(8697, 15)


Unnamed: 0,user,repo,description,stars,forks,language,full_name,name,type,location,lat,long,city,country,rank
0,thedaviddias,Front-End-Checklist,🗂 The perfect Front-End Checklist for modern w...,24267,2058,JavaScript,thedaviddias/Front-End-Checklist,David Dias,User,"France, Mauritius, Canada",,,,,3.0
1,GoogleChrome,puppeteer,Headless Chrome Node API,21976,1259,JavaScript,GoogleChrome/puppeteer,,Organization,,,,,,4.0
2,parcel-bundler,parcel,"📦🚀 Blazing fast, zero configuration web applic...",13981,463,JavaScript,parcel-bundler/parcel,Parcel,Organization,,,,,,11.0
3,Chalarangelo,30-seconds-of-code,Curated collection of useful Javascript snippe...,13466,1185,JavaScript,Chalarangelo/30-seconds-of-code,Angelos Chalaris,User,"Athens, Greece",37.98381,23.727539,Athens,Greece,13.0
4,wearehive,project-guidelines,A set of best practices for JavaScript projects,11279,970,JavaScript,wearehive/project-guidelines,Hive,Organization,London,51.507351,-0.127758,London,United Kingdom,16.0


#### How many Python users are in the new 'repos_users' file?
 - Display the shape of just the python users.
 - Display the dataframe with only python users.

In [11]:
print(repos_users[repos_users['language'] == 'Python'].shape)

# display only the Python users:
repos_users[repos_users['language'] == 'Python'].head()

(1357, 15)


Unnamed: 0,user,repo,description,stars,forks,language,full_name,name,type,location,lat,long,city,country,rank
3308,donnemartin,system-design-primer,Learn how to design large-scale systems. Prep ...,21780,2633,Python,donnemartin/system-design-primer,Donne Martin,User,"Washington, D.C.",38.907192,-77.036871,Washington,United States,5.0
3309,python,cpython,The Python programming language,15060,3779,Python,python/cpython,Python,Organization,,,,,,9.0
3310,ageitgey,face_recognition,The world's simplest facial recognition api fo...,8487,1691,Python,ageitgey/face_recognition,Adam Geitgey,User,Various places,,,,,31.0
3311,tonybeltramelli,pix2code,pix2code: Generating Code from a Graphical Use...,8037,605,Python,tonybeltramelli/pix2code,Tony Beltramelli,User,Denmark,56.26392,9.501785,,Denmark,34.0
3312,google,python-fire,Python Fire is a library for automatically gen...,7663,386,Python,google/python-fire,Google,Organization,,,,,,36.0
