# Data  Wrangling
---
Data Wrangling is a way of transforming raw data into a more readable form.

Examples:
1. Preparing data for input to a machine learning algorithm.
2. Analyzing algorithm output data: transform the data into more understandable formats for graphical uses, etc.

####Mount Google Drive folder on the left panel, which is where our data set files are found.

In [None]:
 # Establish a file path reference for the repo and user files from Google Drive:
 
 repo_path = '/content/drive/MyDrive/Datasets/repos-dump.csv'
 users_path = '/content/drive/MyDrive/Datasets/user-geocodes-dump.csv'

####Pandas and Data Set import:

In [None]:
# now we must import Pandas
import pandas as pd

In [None]:
users = pd.read_csv(users_path, quotechar = '"', skipinitialspace = True)
repos = pd.read_csv(repo_path, quotechar = '"', skipinitialspace = True)

####Repo Preview:

In [None]:
repos.head()

####Users Preview:

In [None]:
users.head()

####Drop duplicate entries, split "full_name" into 2 columns, and rename "id" column to "users"

In [None]:
# Here is an example function which drops duplicate entries from the repos list:

print(f"Shape before dropping duplicates: {repos.shape}\n")
repos = repos.drop_duplicates(subset='full_name', keep='last')
print(f"Shape after dropping duplicates: {repos.shape}\n")
repos.head()

# Since the shape before and shape after are the same, there were no duplicates.

In [None]:
# in the repos list, the full_name format is currently: username/repo_name
# we need to divide the full_name into 2 separate columns
# 1st column is the user name, 2nd column is the repo name

def extract_user(line):
  return line.split('/')[0]

# Alternate:
#def extract_user(line):
#  line = line.split('/')
#  return line[0]

def extract_repo(line):
  return line.split('/')[1]

# This will parse and make 2 new columns:
repos['user'] = repos['full_name'].str[:].apply(extract_user)
repos['repo'] = repos['full_name'].str[:].apply(extract_repo)

print(f"Shape After Processing: {repos.shape}\n")
repos.head()

# Rows are the same as before, but 2 additional columns have been added

In [None]:
# Here is an example function which drops duplicate entries from the users list:

print(f"Shape before dropping duplicates: {users.shape}\n")
users = users.drop_duplicates(subset='id', keep='last')
print(f"Shape after dropping duplicates: {users.shape}\n")
users.head()

# Since the shape before and shape after are the same, there were no duplicates.

In [None]:
# rename the "id" column to "user"

users.rename(columns = {'id': 'user'}, inplace = True)  # renames column
users.head()  # preview the file

####Merging The Datasets

In [None]:
# use merge function in pandas to merge the 2 datasets

# merge "repos" and "users" according to the name of the user, according to the 
# left value (repos).
# There are 8697 repos and 6246 users.
repos_users = pd.merge(repos, users, on = 'user', how = 'left')
print(repos_users.shape)
repos_users.head()

# Notice that the # of columns now is 14 since the datasets are merged.

####Rearranging the columns of merged list: Reindexing

In [None]:
repos_users = repos_users.reindex(['user', 'repo', 'description', 'stars', 'fork', 'language', 'full_name', 'type', 'location', 'lat', 'long', 'city', 'country'], axis = 1)
print(repos_users.shape)  # 13 columns
repos_users.head()

####Ranking based on the "stars" column

In [None]:
# if you wanted to drop a column:
# dataset_anem.dropna(axis = 1)

# adding a new column to the end without modifying the data
repos_users['rank'] = repos_users['stars'].rank(ascending = False)
print(repos_users.shape)  # 14 columns
repos_users.head()

In [None]:
# how many Python users are in the new 'repos_users' file?
print(repos_users[repos_users['language'] == 'Python'].shape)

# display only the Python users:
repos_users[repos_users['language'] == 'Python'].head()