# Tools and Setup

In data science we generally work with a common set of tools and software packages. We'll work to get a few of the basic ones setup now. All of what we'll touch on now will come up repeatedly as we go through the course. 

## Anaconda

Anaconda, the program that you downloaded and installed to get to this point, is a big package of a bunch of useful data science programs and packages. It installs a bunch of this stuff, and more, all at once, so we don't need to hunt around downloading and installing all kinds of stuff - installing Anaconda gives us all a (near) identical starting point. 

Every time we use some function that someone else has developed, which is constantly, we need to have that package installed first. Anaconda is a jump start on adding all of that stuff. We'll look at how to install more stuff later on. In the example below we are importing a package called "platform", and using that to tell us which version of Python we're using. 

<b>Note:</b> you may have more than one version of Python installed (other programs may have installed it), this is fine. We do need to be a little attentive if that is the case, as when we install things we want them to be installed into the "correct" one. 


In [1]:
# Print Python Version

import platform
print(platform.python_version())

3.9.7


## Python, VS Code, Jupyter Notebooks

The first 3 of the tools that we will be using are:
<ul>
<li> Python - python is the programming language we'll use to do our work. Python is the most common language used in data science and is one of the more common programming languages in the world. For us there are a few key benefits:
    <ul> 
    <li> Python is relatively easy to learn and use. Compared to other languages, it is very approachable. 
    <li> Python is commonly used in data science applications. So we can import things that others have written and find examples and documentation online. 
    <li> Python is very commonly used in industry, so the skills we learn here are very transferable to real work. 
    </ul>
<li> VS Code - VS Code is a tool called an IDE - an integrated development environment - basically a text editor used to write programs. We will use VSCode to create, edit, and run what we make. 
<li> Jupyter Notebooks - this one is behind the scenes. This file, and the others that we'll create and use, are called notebook files. These files are special because we can write code, run code, and add webpage like text and images all on one page. This makes it easy for us to do everything in one place. In practice, it is common to develop things in a notebook like we'll be doing, then export the final product to be used in a production environment. 
</ul>

These 3 tools are the building blocks of everything that we'll do. 

### Notebooks

As mentioned above, notebook files like this one are the main type of file we'll use. In a notebook we can write code, run it, see the results, and embed that all in a web page style document. 

Notebooks have a few key features that we should be explore right up front:
<ul>
<li> Cells - the content of a notebook is all in cells; there are two types of cells - markup and code. 
    <ul>
    <li> Markup - markup cells are like this one, fancy text boxes. Each is basically a mini-webpage. We can put text, instructions, or explainations in markup cells. 
    <li> Code - code cells are where the actual program code goes. Code cells can be run, and will output their results below the cell. See a code cell below this one. 
    </ul>
<li> Output - the output of whatever we are doing will display directly below the cell that we run. 
</ul>

Our end result will be something like a web page that can be filled with explainations and information, along with pieces of code that generate the results. The one stop shop of programming. 

#### Example - Simple Python Code

To the left of the cell below, a little play button should appear when you mouse over that cell. Click the play button to run (execute the code) in the cell and see the results printed below. 

In [2]:
# Do some simple stuff 
print("I am a code cell!!")
a = 3
b = 4
print(a, "times", b, "is", a*b)

I am a code cell!!
3 times 4 is 12


### VS Code

VS Code is a full featured program to develop other programs, there are a lot of features and capabilities of which we only need a few. 

The good thing about VS Code is that it incorporates almost everything we need into one centralized program. We can write code, run it, manage files, and sync to a repository all from this one window. 

VS Code consolodates several pieces of functionality into one tool:
<ul>
<li> At it's core, it is an IDE (Integrated Development Environment), or basically a text editor for computer programming. 
<li> VS Code also includes an "environment" in which we can execute our programs (these notebook pages). 
<li> It also integrates with GitHub (repository for computer code), for saving and sharing code. 
</ul>

This means that we can write, run, save, and share our work all from one tool, which helps us keep things relatively simple. One key thing to note is that nothing we do "belongs" to VS Code - we can write and execute this code anywhere, we've just chosen this as our tool. Just like you can create a document in Word, Google Documents, or whatever Apple gives you with a Mac. The contents of your document can be created in any of the tools, shared between them, and "executed" (read by someone) in the same way no matter how it was made. 

#### Example - Load Some Data

Loading a dataset is a very common first-ish step for our work. Here we'll load a CSV file, which is most common, but we can also load different types of files or connect to data in a database or over the internet. 

We'll also load a package called Pandas to help us (computer scientists are not known for excellent naming practices). The head(x) command gives us a preview of the first 'x' rows of our data; when the x is missing, the default is 5. This type of optional/default setup for arguments is common. 

In [3]:
import pandas as pd

df = pd.read_csv("../data/train.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### VS Code Live Share and Pair Programming

One other cool thing that VS Code allows us to do is a live sharing session of the code window. This will allow me to post a link for a class session from my VS Code install on my computer. You can take that link, load it in VS Code, and you will get whatever I'm doing updated live in your VS Code window. It is kind of like a screen share, but just for the code file, so you can still configure and resize the VS Code window as you please, and only the code inside of it will update.

This requires a little bit of setup to use, but is easy:
<ul>
<li> Install Live Share and Live Share Extension Pack extensions by clicking on the Extensions icon (4 blocks with one "flying away") in the toolbar on the left, searching for the two extensions, and clicking install. 
<li> A new Live Share icon (a circle with an arrow over it) on the left. To join a session take the link I post, click "Join", and enter the link. 
</ul>

You can also use this to work together with others. "Pair Programing", or working side-by-side with someone to talk through issues when you are coding is a very common practice, and it can really help when dealing with something compex or confusing. You can click the "Share" link and generate a link that you can share with team members or friends - feel free to do so if we have a spot in class where we break for you all to work on an exercise. As well, if you're working on a project with others or grinding through some practice problems, this can be very useful. 

There is a lot of documentation online, one explainer with a video is located here: https://code.visualstudio.com/learn/collaboration/live-share The Live Share window also has a link to the full documentation. 

### Python and Environments

Our code will be written in a programming language called Python; all of the code cells on this page are examples of Python code.

In addition to being a programming language, Python also comes with environments. An environment is the "universe" inside of which each Python program runs. We can have several, and you may have some different ones on your computer that came with other programs you may have installed. 

There is a little icon in the top right of the notebook that indicates which environment (a.k.a kernel) you are currently using. By default, Anaconda sets one up named "base" that has a bunch of useful default stuff in it. We generally want to use this one. Each of the libraries (things such as pandas) that we install will only "exist" in one environment - the one in which they were installed. As we go through the semester we'll add other libraries to do other stuff, if we try to run our code in an environment where that stuff doesn't exist, we'll get an error, since it can't find it. 

This part is very important to people who are producing programs that are going to be distributed to others (since they can rely on all the stuff they need existing in that environment), it is mostly just a minor annoyance to us, since we don't need to setup multiple different environments. 

#### Python vs Notebooks

We will do all of our work in Python in these notebook files. This is very common for data science work as we can make our programs and see the results all in one page. In "real" production environments we would probably still use notebooks just like ours to do the development, then export part (either a trained model or portion of code) into another format, which would then be integrated into the actual working systems. We are working entirely on that preparation part, so the last deployment step doesn't really matter to us. 

If you've ever programmed before you may have seen "regular" python programming in regular, non-notebook files, likely with a .py file extension. Those are created using the same Python language, but are intended to be run like a "normal" program (i.e. someone presses "Run" and the program goes) rather than in the interactive environment we are using here. The language is the same in those files and the code we write, outside of a handfull of commands that are special to either environment; if we want an example of something and we find it in a "regular" python file, that's still useful because it is almost certainly the same in our notebooks. 

#### Example - Information on our Dataset

We can use some simple commands to get some information about our data. 
<ul>
<li> The describe() command gives us basic statistics (we'll cover these more soon). The include="all" part tells it to include things that aren't numbers. 
<li> The info() command gives us some info on the types of data we have. 
</ul>

In [4]:
df.describe(include="all")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### Saving and Comitting

If we look at the toolbar on the left of VS Code, we'll usually see some little blue balls with numbers in them as we work. These bubbles indicate our pending changes, and they are important for us when we are saving or uploading our work. 

If there is a bubble on the top (Explorer) icon of the toolbar, that is an indication that you have changes to your files that are not yet saved. The number is the number of files that have been modified. Use the Save or Save All command to save them. 

If there is a bubble on the thrid (Source Control) icon of the toolbar, that is an indication that you have saved changes that have not been "pushed" (uploaded) to the central repository on GitHub. 

## Git and GitHub

Another of our foundational tools is GitHub, and its component Git. Git and GitHub are tools that help us manage and store code, share it between multiple contributors, update versions, and package final products. It is effectively a file manager for programs being developed. 

<b>Note:</b> At some point in setting up VS Code you'll need to install Git (without the hub) for this part to start working. In the window on the left, there will be a link to "Git" at some point, click that and follow the directions. This part may differ depending on what you've installed on your machine before and if you're on a Windows or Mac, there may be instructions to install some other program along the way, just follow those directions. 

Like the other tools GitHub has a lot of capabilites, and we only need the basics. We'll use GitHub to:
<ul>
<li> Share content with you, like this repository. I can give you the Github link for you to clone the repository, you then get a copy of the entire set of files. 
<li> Update changes. As we go through the semester I can update the files in this repository, and you can use the "Pull" command to automatically get all my changes. 
<li> Manage your work. You can save your work to your own GitHub repository and it will do things like track changes and versions for you. 
<li> Share with others. When there is more than one person working on a project, GitHub will manage changes made by different people to ensure that things stay in sync. 
<li> In more elaborate setups you can also use GitHub to do things like take in bug reports or run automated testing routines. 
</ul>

GitHub is a very, very commonly used tool in industry. It can be a little bit of a headache initially, but it is definitely worth the hassle of learning how to use it. If you work in any programming related job there is a very high likelihood you'll use Git, or an alternative that does the same thing. 

### Important Actions

There are a few fundamental things we need to know and be comfortable with to use GitHub and be successful:
<ul>
<li> Cloning a repository - you've all done this to get here, cloning a repository makes a copy of a code repository on GitHub and saves it to your computer. You can then work on it, and if you were in a real job you'd have other people working on the same repository at the same time. For assignments, you'll clone the repository I post that has your starting point in it, then start work on your copy. 
<li> Committing your changes - as you work on things like projects and assignments you can submit your progress into your own GitHub repository. Each time you submit changes GitHub will do a lot of work to maintain your code - versions will be archived, so you can roll back changes; if you are working with others, your changes will be merged; if your computer goes up in flames everything is saved online. 
</ul>

#### Using GitHub

There are lots of things that we can do in GitHub, but we'll mostly stick to relatively simple actions:
<ul>
<li> Cloning a repository - you've all done this to get here. Take the link, go to the Source Control window, click Clone Repository, enter the link, and choose where to save it on your computer. When you do assignments you'll do this to download a repository that is your starting point. 
<li> Pull - update FROM the online repository TO your computer. As I make updates to these workbooks throughout the term, you'll regularly (before every class usually) perform a pull, or grab any changes and update your machine. In the source control window, click the 3 dot icon in the header and choose Pull in the menu. 
<li> Commit - push FROM your computer TO the online repository. As you work on an assignment, you'll want to regularly commit as you progress. Each time you do, that version will be logged on the server and backed up. In the source control window, click the check mark logo, enter a note for the update, and press enter. 
</ul>

## Libraries and Installing Things

One thing you'll see all the time in Python, and programming in general, are "import" statements littered about, especially at the top of a program. These import statements grab libraries - packages of code that other people have written, and include it in our program so we can use it. Most of the stuff that we use isn't part of the core of Python itself, they are things that were written, shared, and reused over and over. Some of the common ones that we'll spend time with right away are:
<ul>
<li> Pandas - provides the dataframes that hold data. 
<li> Numpy - provides an assortment of math-like things, as well as arrays which we'll use later. 
<li> Seaborn - provides graphing and visualization functions. 
</ul>
There is a near infinite list of other ones that may be useful, we generally want to use these things when they exist (outside of doing something for the purposes of learning about it), as published packages are generally tested, optimized, and maintained, so they'll normally function better than whatever we can write. Anaconda packages many of the common libraries into one bundle, so for most things we can just add an "import whatever" statement and use it. There are lots of things that Anaconda doesn't package, so if we need one of those things, we need to install it. This is usually a simple process, but may require some setup and config for your particular machine, especially if you're on Windows. 

There are a few ways we can install things, from easiest to most complex:

#### Install via Magic Commands

We can write a special command in code that basically sidesteps python and runs a command on your underlying machine. We can add one of these at the head of a program that needs to install stuff, and it will run the installation if it is needed when you run the code. This is useful if you may be running code in different environments, such as running code in VS Code and Google Colab, as it will do the installation right up front. Again, there are two ways to install stuff here:
<ul>
<li> Install using conda. Anaconda has an installer that we can run. 
    <ul>
    <li> !conda install PACKAGENAME
    <li> E.g. to install pip - !conda install pip
    </ul>
<li> Insall using pip. Pip is the most common python installer, and if you see examples online they will normally use pip. You MAY need to install pip first to use it. I normally use pip
    <ul>
    <li> !pip install PACKAGENAME
    </ul>
</ul>

#### Install via Terminal Commands

We can also install things using the terminal. VS Code allows us to open a terminal inside the program, this is the same as if we opened a command prompt or terminal window on our computer. We can then run the same commands as above, but directly in the terminal. To open a terminal go to the Terminal menu and choose New Terminal. Once at the command prompt, we can use the same commands as above, just without the ! at the front.

#### Install via Anaconda GUI

Packages can also be installed via the Anaconda GUI. This is the most user-friendly way to do it, but it is also the most manual and least repeatable so we won't focus on it much. The Anaconda documentation has a good guide on how to do it: https://docs.anaconda.com/anaconda/navigator/tutorials/manage-packages/#:~:text=To%20install%20a%20package%20from,then%20click%20the%20Apply%20button. 

#### Install via Requirements File

We won't really do this, because it is more for production, but another common way to install needed stuff is to use a requirements file, which is just a text file with a list of needed libraries (and specific versions, if desired). An action can be setup to verifiy that all the requirements are met when testing/executing the code. For example, you could setup a repository on GitHub that ran a set of tests on your code when you check it in, and setup an environment with all the requirements in the process of doing so. 

### Potential Issues

There are a few, machine specific things, that can go wrong or weird with this. 

#### PATH Setup

The most common issue is a need to add something to the PATH. The PATH is roughly something that defines what programs you can run by name - i.e. if you were to open a terminal and type "Excel" and hit enter, MS Excel would probably open. If you type "conda/pip install whatever" and there's an error that it is an unknown command or similar, this is likely the issue. The solution is generally simple, you need to edit a value in your OS's configuration. Google "pip/conda add to path [my operating system version]" and there should be a multitude of examples online showing you what to add with screenshots and even recordings. 

#### Environments

One thing to be attentive to in all this is the python environment. These installs are a per-environment thing, so an easy way to get confused is to install one in a different environment by accident. 

## Setup Tasks

<ol>
<li> Install pip. 
<li> Create backup environment from the "base" one. 
<li> Do some stuff. 
</ol>