<img src="https://drive.google.com/uc?id=1E_GYlzeV8zomWYNBpQk0i00XcZjhoy3S" width="100"/>

# DSGT Bootcamp Week 1: Introduction and Environment Setup

# Learning Objectives

1. Gain an understanding of Intel Developer Cloud
2. Introduction to team project
3. Gain an understanding of Kaggle
4. Download and prepare dataset
5. Install dependencies
6. Gain an understanding of the basics of Python
7. Gain an understanding of the basics of GitHub / Git

<img src="https://www.kaggle.com/static/images/site-logo.png" alt="kaggle-logo-LOL"/>

# Introducing Kaggle



#### [Kaggle](https://kaggle.com) is an online 'practice tool' that helps you become a better data scientist. They have various data science challenges, tutorials, and resources to help you improve your skillset.


#### For this bootcamp, we'll be trying to predict trends using the Kaggle Titanic Data Set. This dataset models variable related to the passengers and victims of the Titanic sinking incident. By the end of this bootcamp, you'll submit your machine learning model to the leaderboards and see how well it performs compared to others worldwide.

#### For more information on Kaggle, check out the resources section.

# Accessing the Titanic Dataset

#### To speed up the data download process, we've placed the data in this workshop folder. If you look under "Files", you'll see a file called `titanic_train.csv`.


In [7]:
"""
You can use the following commands in Colab code cells. 
Type "%pwd" to list the folder you are currently in and "%ls" to list subfolders. Use "%cd [subfolder]"
to change your current directory into where the data is.
"""
%pwd

'/home/u0ffb3364e89b0a170bc2fe23407f010/In_Person/Workshop One'

In [8]:
%ls

'Workshop 1 Drill Spring 2024.ipynb'   [0m[01;34mhello[0m/
 Workshop_1_Spring_2024.ipynb          titanic_train.csv


In [3]:
"""
You can use mkdir to make a new directory and rm to remove a directory
"""
%mkdir hello #make a directory called hello
#%cd hello
#%cat > cat.txt #create a file called cat.txt in hello/
#%ls 
# get out of hello directory
#%cd ../. 
#%ls
#%rm -rf hello #remove the hello directory and all subdirectories and files within
#%ls

mkdir: cannot create directory ‘hello’: File exists


FYI, you can try out the above terminal commands in the terminal mode. Go to "Project" on the right sidebar. Under "Termainals", click the "+" icon to create a new terminal. The terminal is already configured for you to run your Linux commands. 

# Read Data with Pandas

#### `!pip install` adds libraries (things that add more functionality to Python) to this environment as opposed to your machine. 

#### We'll worry about importing and using these libraries later. For now, let's just make sure your environment has them installed.

#### Applied Data Science frequently uses core libraries to avoid "reinventing the wheel". One of these is pandas!

In [9]:
!pip install pandas

Defaulting to user installation because normal site-packages is not writeable


In [5]:
import pandas as pd #pd is the standard abbreviation for the library

#### Now that we're in the correct folder, we can use pandas to take a sneak peek at the data. Don't worry about these commands -- we'll cover them next week!

In [10]:
df = pd.read_csv("titanic_train.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


One thing that is really cool with this dataframe is that you can filter and sort columns in the notebook itself! Super handy if you want to explore your data at a base level before writing your code. There is also a widget you can use to create visualizations for columns within your dataset. We'll dive into pandas deeper in Workshop 2

# Introduction to the Python Programming Language

### **Why do we use Python?**
- Easy to read and understand
- Lots of libraries for Data Science
- One of the most popular languages for Data Science (alongside R)


# Primer on Variables, If Statements, and Loops

In [11]:
#You can create a variable by using an "=" sign. The value on the right gets 
#assigned to the variable name of the left.

a = 10
b = 20
print(a + b)

c = "Data Science "
d = "is fun"
print(c + d)

30
Data Science is fun


In [8]:
#If statements allow you to run certain lines of code based on certain conditions.

if (c + d) == "Data Science is fun!":
  print("Correct!")
else: # this section is only triggered if (c + d) doesn't equal "Data Science is fun!"
  print("False!")

False!


In [12]:
#For loops are used to perform an action a fixed amount of times, or to go through each element in a list or string
for index in range(0, a):
  print('DSGT')

DSGT
DSGT
DSGT
DSGT
DSGT
DSGT
DSGT
DSGT
DSGT
DSGT


In [10]:
#In this block of code, c+d is treated as a list of letters, with letter serving 
#as each individual character as the for loop iterates through the string.
for letter in c + d:
  print(letter)

D
a
t
a
 
S
c
i
e
n
c
e
 
i
s
 
f
u
n


# Lists, Tuples, and Dictionaries

In [11]:
# Let's start by creating a list (otherwise known as an array)
c = ["a", 2, "c"] # a - 0, 2 - 1, c - 2

# We can retrieve an element by accessing its position in the array. 
# Position counting starts at 0 in Python.
c[0] = 5
print(c[0])

5


In [13]:
# Tuples are lists but they don't like change
tup = ("car", True, 4)
tup[2] = 5 #would cause an error

TypeError: 'tuple' object does not support item assignment

In [14]:
# Dictionaries are unordered key, value pairs
d = {"Data Science": "Fun", "GPA": 4, "Best Numbers": [3, 4]}

# We can get values by looking up their corresponding key
print(d["Data Science"])

# We can also reassign the value of keys
d["Best Numbers"] = [99, 100]

# And add keys
d["Birds are Real"] = False


#We can also print out all the key value pairs

print(d)

Fun
{'Data Science': 'Fun', 'GPA': 4, 'Best Numbers': [99, 100], 'Birds are Real': False}


## Functions

In [15]:
# Functions help improve code reusability and readability. 
# You can define a series of steps and then use them multiple times.

def add(a, b):
  sum = a + b
  return sum

print(add(2, 4))
print(add(4, 7))
print(add(3 * 4, 6))

6
11
18


**Note**: A lot of Data Science is dependent on having a solid foundation in Python. If you aren't currently familiar, we *highly recommend* spending some time learning (tutorials available in resources). Otherwise, using the libraries and parsing data in a foreign language may make things rather difficult.

## **Introduction to the Version Control and GitHub**
What is Version Control?
*  Tracks changes in computer files
*  Coordinates work between multiple developers
*  Allows you to revert back at any time
*  Can have local & remote repositories


 What is GitHub?
 *  cloud-based Git repository hosting service
*   Allows users to use Git for version control 
*   **Git** is a command line tool
*   **GitHub** is a web-based graphical user interface

# Set Up

If you do not already have Git on your computer, use the following link to install it:

[Install Git](https://git-scm.com/downloads)

**Setting Up a Repo**


*  `git config`

  *   `git config --global user.name "YOUR_NAME"`

  *   `git config --global user.email "YOUR_EMAIL"`


**Create a Repo**

*   `git init`

*   `git clone [URL]`

** You can use
https://github.gatech.edu with
YOUR_NAME = your username that you log into GitHub with (GT Username for https://github.gatech.edu)
YOUR_EMAIL = your email that you log into GitHub with (GT Email for https://github.gatech.edu)
**

**GitHub GUI**

One person in each team will create a "New Repository" on GitHub. Once they add team members to the repo, anyone can clone the project to their local device using "$git clone [URL]" .

# Steps for Using Git


1.   Check that you are up to date with Remote Repo -- `git fetch`
  *   check status -- `git status` 
  *   if not up to date, pull down changes -- `git pull`

2.   Make changes to code
3.   Add all changes to the "stage" -- `git add .`
4.   Commit any changes you want to make -- `git commit -m [message]`
5.   Update the Remote Repo with your changes -- `git push`


**Summary**

3 stage process for making commits (after you have made a change):


1.   ADD
2.   COMMIT
3.   PUSH


# Branching

By default when you create your project you will be on Master - however, it is good practice to have different branches for different features, people etc.

* To see all local branches --  `git branch` 

* To create a branch -- `git branch [BRANCHNAME]`

* To move to a branch -- `git checkout [BRANCHNAME]`

* To create a new branch **and** move to it -- `git checkout -b [BRANCHNAME]`

# Merging
Merging allows you to carry the changes in one branch over to another branch. Github is your best friend for this - you can open and resolve merge conflicts through the GUI very easily. However, it is also good to know how to do it manually, in the event that you are unable to resolve conflicts. 

**Manual Steps**
1.   `git checkout [NAME_OF_BRANCH_TO_MERGE_INTO]`

2.   `git merge [NAME_OF_BRANCH_TO_BRING_IN]`



**Note:** Github provides a benefit for university students called the GitHub Student Developer Pack. As long as you have your university id (eg: BuzzCard), you can simply visit this link and use your personal GitHub account to apply for the Student Developer Pack. See this [link](https://docs.github.com/en/education/explore-the-benefits-of-teaching-and-learning-with-github-education/github-global-campus-for-students/apply-to-github-global-campus-as-a-student)

# Helpful Resources

#### [Colab Overview](https://colab.research.google.com/notebooks/basic_features_overview.ipynb)
#### [Kaggle Courses](https://www.kaggle.com/learn/overview)
#### [Kaggle](https://www.kaggle.com/)
#### [Intro Python](https://pythonprogramming.net/introduction-learn-python-3-tutorials/)
#### [Pandas Documentation](https://pandas.pydata.org/docs/)
#### [Pandas Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
#### [Github Tutorial](https://guides.github.com/activities/hello-world/)
#### [GitKraken](https://www.gitkraken.com/) - Recommended to get the Github Student Developer Pack with your personal Github Account to access some cool tooling in GitKraken