# Getting Started with the Jupyter Notebook (Part 2)
---

If you haven't already read part one of the [Getting Started with the Jupyter Notebook](https://charliesimms.medium.com/jupyter-notebooks-for-beginners-c98e7b829a02) tutorial, now would be a great time to check it out.


In part one, we briefly touched upon Markdown, the lightweight markup language used by data scientists and analysts to create nicely formatted Jupyter Notebooks. When using Markdown for the first time, you'll quickly recognize many of its formatting tools (like **bold** and *italics*) because they're the same ones found in word processors. The difference between those formatting options and the ones found in Markdown is that Markdown requires you to wrap the target text in tags consisting of asterisks, hashtags, and backticks. 

---

# A Rundown of Markdown
---

## Header Text

You can mark different sections of text by using different header levels. (In the past, you could also use the Jupyter Notebook _Header_ cell, but that feature has been deprecated). Markdown provides support for six header levels, which are denoted with one or more hashtag `#` characters, with the number of hashtags indicating the header level. For example, to make a first level header text called *Medium* followed by a second level header called *Jupyter Notebooks*, you would use the following:

```

# Medium
  
## Jupyter Notebooks

```

---

## Paragraphs

Paragraphs are just one or more lines that are enclosed within blank lines. If you need to insert a line break between two lines of text (which is useful for lists), just add two or more blank spaces at the end of the line to break.
  
```
Here is a simple Markdown formatted text that consists of several sentences. 

This demonstrates how normal text can be written in a Jupyter Notebook cell, which can additional information about a particular data science project.
```

---

## Formatting Text

If you want text to be _italicized_, you simply enclose it in either a single asterisk `*` character or single underscore `_` character. Similarly, for **bold** text, you enclose the text in two asterisks `**` or underscores `__`. My personal preference is to add asterisks; for example, the following Markdown text will first create **bold text** followed by *italicized text*:

```
first create **bold text** followed by *italicized text*
```

---

## Lists

You can create two types of lists in Markdown: unordered and ordered, both of which can be nested. To create a basic unordered list, all you need to do is prefix the list entries with a single asterisk `*`, dash `-`, or plus `+` character. To create a nested list, you simply indent the nested list by one or more spaces. Personally, I prefer using dash `-` characters to indicate unordered lists. For example, this formatted Markdown code

```
- Item 1
- Item 2
 - Item 2.1
 - Item 2.2
- Item 3
```

will produce the following list:

- Item 1
- Item 2
 - Item 2.1
 - Item 2.2
- Item 3

Ordered lists are basically the same, but are prefixed by numbers and a period, for example `1.` for the first item. The Markdown numbers used to indicate list items do not have to start at one, nor must they be sequential. For example, the text generated from the Markdown code below will be formatted to begin at one and proceed in order.

```
2. Item 1
4. Item 2  
1. Item 3
```

will produce the following list:
2. Item 1
4. Item 2  
1. Item 3

---

## Displaying Formatted Code

One of the most helpful features of Markdown is its ability to include formatted code directly in text. Code blocks can be inserted into text in two ways: inline and block. Inline code elements are wrapped in single starting quote \` characters (also called backticks). For example, the Markdown \`print("Hello World!")\` renders as `print("Hello, Medium!")`. Code blocks can also be placed on a single line with no quote characters by indenting the line with four characters:

```
    print("Hello, Medium!")
```

For longer code blocks, you can enclose the block in three single quote characters \`\`\`. For example, the following code block in Markdown:   

```markdown
\`\`\`  
x = 3  
y = 4  
z = 5   
  
print("%a" % (x\*\*2 + y\*\*2 = z\*\*2)   
\`\`\`
```

will produce the following code block:  

```
x = 3  
y = 4  
z = 5   
  
print("%a" % (x**2 + y**2 = z**2) 
```   

Note: Jupyter Notebooks also allow you to produce Github flavored Markdown to indicate the target program language by appending the name of the language to three quote characters. For example, to indicate Python, you would use \`\`\`python, while to indicate Javascript, you would use \`\`\`javascript. For example, if we apply this to the previous example, it'll produce the syntax-highlighted version below:

```python  
x = 3  
y = 4  
z = 5   
  
print("%a" % (x**2 + y**2 = z**2) 
```

---

## Quotes in the Notebook

You also can write quoted text by prefixing a line with a greater-than `>` character. You can also write multi-line block quoted text by prefixing every line with a greater-than `>` character. For example, the code below:  

```
\> Here is the first line in a paragraph 
\> that we're wrapping over multiple lines by inserting a line break  
\> that's actually multiple space characters placed at the end  
\> of a line.
```

will produce the following block of text:

> Here is the first line in a paragraph 
> that we're wrapping over multiple lines by inserting a line break   
> that's actually multiple space characters placed at the end 
> of a line.

---

## Mathematical Notation

It's possible to include math formulae in Markdown cells by using [LaTeX](https://www.latex-project.org/), a text formatting language that's heavily used in academia for lesson plans and scientific articles. LaTeX isn't the most intuitive typesetting system, but it is easy to use it to insert simple mathematical formulas into Juptyer Notebook cells.

To indicate a LaTeX formula, the simplest approach to inserting LaTex formulas into your cells is to wrap the text in dollar signs, `$ ... $` or, if you want to make the expression appear on its own line, you wrap the expression in double dollar signs, `$$ ... $$`. Many specific functions or math symbols are prefixed with a backslash character, `\`. For example, if you're a triogonometry professor who needs to write the LaTeX formula for the lowercase Greek character theta, you would write `$\theta$`. There are many [helpful resources](https://en.wikibooks.org/wiki/LaTeX/Mathematics) for learning LaTeX freely available on the internet, and there are tools like [LaTeXit](https://pierre.chachatelier.fr/latexit/latexit-home.php) that can help you create and test LaTeX expressions on your own.

For example, the LaTeX expression from the first part of the tutorial

```
$\int_0^{\pi} \sin(\theta)\ d\theta = 2$
```

gets translated into 

$\int_0^{\pi} \sin(\theta)\ d\theta = 2$ 

in a Jupyter Notebook cell by MathJaX.

LaTeX can also be used in code blocks to provide simpler or more descriptive plot labels (for example, theta versus $\theta$).

---

# Python Programming Language

---

## Programming Basics

Personally, I feel that computer programming is a difficult skill to learn without any guidance. When I first started programming in 2014, I was overwhelmed by the glut of freely available information on the internet. Remembering how intellectually impotent I felt back then, here's what I would say to someone who's learning to program:

1. Learn language-independent concepts before learning to use a particular language.
2. Learn what problems a programming language was designed to solve, and how they solved it.
3. Figure out what programming paradigm a particular language supports. (Few languages implement a paradigm 100%, but many facilitate programming in multiple paradigms.)

#### <font color ='red'>Note:</font> Feel free to scroll down to Python Basics if you want to jump right into programming.



### Language-Independent Concepts

At the most basic level, a computer program is comprised of nothing more than data and instructions. When I say data, I mean numbers, characters, strings, Boolean values, etc., and intructions refer to the statements we write that control the program's decision-making process. Syntax concerns itself with how those instructions are formatted, while semantics deals with the logic of the instructions. 

#### High-Level vs. Low-Level

In a broad sense, there are two categories of programming languages: low-level and high-level. A low-level language (like C++) resembles the numeric machine language of the computer more than the natural language of humans. A high-level language (like Python) is closer to the level of human-readability. 

![CodingLanguageLevels.jpeg](attachment:CodingLanguageLevels.jpeg)


#### Program Implementation
Interpretation and compilation are the two primary means of implementing a computer program.


In a **compiled** language (like C++), source code is first executed by a compiler, which translates the code into machine language, and that machine language directly controls the CPU. 


In an **interpreted language**, the source code is not directly translated by a compiler. Instead, a different program, called the interpreter, reads and executes the code.


If you've ever heard someone complain about Python being a *slow* language, it's because running interpreted code is slower than running compiled code. With Python, the interpreter must analyze each statement in the program *every* time it's executed and then perform the instructed actions. With C++, the compilation process runs once and the execution time is set. If there's an error in the C++ source code, it won't compile. The same error in a Python program is discovered *while* the program is executing, which can slow the development process considerably.


#### <font color ='red'> Note:</font> A compiled language is often considered a low-level language, while an interpreted language is considered high-level. However, we should take care not to conflate levels of program abstraction with the methods of their implementation.


#### Syntax vs. Semantics

Let's look at some examples from the C++ and Python programming languages to illustrate the difference between syntax and semantics.

##### Syntax
Below, we have two code segments that produce the same output in their respective enviroments, the numbers 1 through 10 printed vertically. The syntactical difference between the languages should be apparent.

In the C++ programming language, we denote a block of a code with curly braces:

```C++
# C code
for(int i = 1; i < 10; i++)
 {
    # curly braces indicate code block
    cout << i << endl;
}
```
In Python, we denote a block of code with indentation:
```Python
for i in range(10):
    # indentation indicates code block
    print(i)
```

Although both source code segments produce the same output, they are not executed in the same manner, and that's because each language has a different implementation.


#### Type Checking

Type checking is a form of program analysis that verifies something about the data types used in a program. When we talk about data types (or just *types*), we're referring to an attribute of data which informs the compiler or interpreter how it's going to be used in the program. For example, let's take another look at our C++ and Python source code above.

In the C++ code below, **int** is a data type in the C++ language, and we had to declare that the variable **i** was in fact an integer *before* we could assign a number to it. This is because C++ is a **statically**-typed language, where type-checking is done at compile time. In other words, you have declare the data type of a variable before you can use it. 

```C++
# C code
for(int i = 1; i < 10; i++)
 {
    cout << i << endl;
}
```
Now take a look at the Python code.
```Python
for i in range(10):
    print(i)
```
As you can see, we did not have to assign a data type to the variable **i** before using it as an int. This is because Python is a **dynamically**-typed language, where type-checking is performed during interpretation, so you don't have to declare the data type of the variable before it's used. 

So, type-checking either occurs when a program is compiled (a static check) or interpreted (a dynamic check). 


If a particular programming language requires its implementation to strictly adhere to type-checking rules, it's referred to as a **strongly**-typed language. If it doesn't, it's considered a **weakly**-typed language. 

--- 

## Python Basics

While Python is a relatively easy language to learn, there are some basics that should be reviewed before dive deeper into this tutorial. To help reinforce the idea that good Python code should be easy to read, Python created some guidelines:

1. white space is important,
2. names should be descriptive,
3. code blocks are indented four spaces (not tabs) and follow a colon,
4. lines of code should be limited to less than 80 characters, and
5. good code should be throughly documented both with comments and descriptive documentation strings.

If lines are longer than 80 characters, best practice is to use parentheses to group operations and indentation to maintain readability. If, however, this is insufficient, a forward slash, `/`, can be used to extend code over as many lines as necessary.

---

### Python Identifiers

A Python identifier is a name that we give to identify variables, functions, classes, modules, or any other object, and it must follow these rules:

1. The first character must be a letter or an underscore character.
2. Variable and Function names traditionally start with a lowercase letter.
3. Classes traditionally start with an upper-case letter.
4. The identifier cannot be one of the reserved Python keywords, listed in the code block below.

Note: it's recommended to avoid using names of objects from common Python libraries, like the string, list, or tuple to minimize name collisions and any resultant confusion.

Python identifiers are case sensitive, so `myname` is different than `myName`. One of the easiest ways for software engineers to spot a beginning programmer is through meaningless variable and function names, so be sure to write descriptive identifiers. They're beneficial for code readability and subsequent maintenance, which is why we often write multi-word identifiers. When combining words for an identifier, you can either use camel-case format, where each word *after* the first is capitalized like `myNameSpace`. Alternatively, you can also partition words by using underscores such as `my_name_space`. Regardless of which approach you utilize, it's best to be consistent with your indentifier names.

The Python Enhancement Proposal, [PEP-8](https://www.python.org/dev/peps/pep-0008/), provides comprehensive documentation regarding the recommended best practices when writing Python code.

---

### Documentation

If you're new to computer programming, you should know that well-documented code is imperative to code reusability and maintainability, and the way we do that is through comments. Python supports two types of comments. The first type of comment is a single-line comment, which begins with a hashtag `#` and continues until the end of the line. (The `#` character can appear anywhere on the line.) Here's an example of single line comments - the first consists of the whole line, while the second starts after the command:

```python   
# Calculate the sine of the number 3

x = math.sin(3) # Assume radians
```

The second type of comment is a multi-line comment, which begins and ends with either three single quote characters,  `''' comment text '''` , or three double quote characters in a row: `""" comment text """`. This comment can span multiple lines, which is why it's used in programs to provide documentation via implicit docstrings for functions and classes. Here's an example of a multi-line comment string:

```python   

'''
This comment block can serve as documentation
for a function, class, or module.

You can also utilize whitespace here 
to write more clearly.
'''
```

The built-in `help` function can be used to view _docstring_ comments for different functions, classes, or other Python constructs, as seen in the code block below. As the appended comment suggests, you should execute this function and change the argument to the `help` function to view documentation for other built-in functions like `str` or `int`.

Note: another built-in function commonly used is the `print` function which can convert its arguments to a string and display the resulting string to *STDOUT* (which is generally the monitor).

---

In [66]:
help(sum) # Try changing sum to something different like str or int.

Help on built-in function sum in module builtins:

sum(iterable, /, start=0)
    Return the sum of a 'start' value (default: 0) plus an iterable of numbers
    
    When the iterable is empty, return the start value.
    This function is intended specifically for use with numeric values and may
    reject non-numeric types.



--- 

# Intro to Unix

In you're following along with the steps in this tutorial, the Jupyter Notebook will be running on your own server. If the accomapanying .ipynb file for this tutorial could only be accessed from the cloud, the Jupyter system you would be using would be running on a (virtualized) Unix system. Additionally, since you'll need to read and write data, you'll be working directly with the Unix filesystem. Finally, most of us will have to use cloud systems, either explicitly or implicitly, in our (future) careers, so it's prudent to learn the basics of Unix in order to be a more proficient data scientist or analyst. 

The Unix operating system is a brilliant and pragmatic piece of technology that underpins many current operating systems, including both Linux and Mac OSX. We'll quickly review basic Unix concepts:

- the Unix Shell,
- the Unix filesystem,
- file permissions, 
- how to work with directories and files,
- anonymous file downloads, and
- how to view the contents of files.

Note: prefixing shell commands in a code cell with an exclamation point (`!`) will execute commands from the *underlying* operating system, not the IPython kernel underpinning this notebook. For example, because I'm using Windows 10, running `! dir`, the **directory** command for the Windows command prompt, in a code cell is equivalent to running `ls`, the **list directory** command for Unix-based operating systems.

---

## Unix Shell

A typical [Unix](https://en.wikipedia.org/wiki/Unix) system provides a *command-line interface* (aka **CLI**) to facilitate communication between the programmer and underlying hardware. At first glance, a terminal appears much more intimidating to use than the graphical user interfaces we've grown accustomed to from Windows and macOS. Although not as visually appealing as a GUI, command lines have several benefits:

1. Powerful, comprehensive access to the underlying hardware
2. Standardized interactions between local and remote hardware
3. Running batch operations (called scripts) in seconds that would take much longer to complete using a GUI
4. The ability to run multiple commands simultaneously with the pipe (|) operator
5. It makes you seem like a tech wizard to friends and family

The standard command-line interface on a Unix system is provided by a program called a *shell*. While several different shells exist for Unix-like systems, one of the more popular ones is the [Bash shell](https://www.gnu.org/software/bash), which is what we'll use here.

While it isn't difficult to learn the programming language that controls the shell, an in-depth discussion of it is beyond the scope of this tutorial. However, we will go over a few of the basics of the shell language to help us understand what we're doing inside this Jupyter Notebook.

---

## Shell Syntax

### Variables

In 

In [67]:
%%bash

myvar="Hi there"

echo $myvar
echo "$myvar"
echo 'myvar'
echo \$myvar

Hi there
Hi there
myvar
$myvar


## The Unix Filesystem

The [Unix filesystem](https://en.wikipedia.org/wiki/Unix_filesystem) provides information storage and retrieval from the underlying hardware, as well as interprocess communication through a pipeline of input and output streams. The Unix filesystem is based on a single rooted tree model. The root of the tree is the __root__ directory, and is denoted by the `/` character. Sub-directories branch off from this root directory to form the rest of the filesystem's hierarchy.

Unix directories don't contain files, but rather the *names* of files that are paired with references to so-called **inodes**, which in turn contain both the file and its metadata (owner, permissions, access times, etc.). Multiple names in the file system hierarchy may refer to the same file, and if they do, the term for the name of the file is **hard link**.

Both files and directories have owners and groups. A special owner is known as *root*, or the superuser. If you have the proper privileges, you can switch to the superuser by using the `sudo` command. Every entry within the file system has permissions that specify what the owner, the group, and whomever else may do with the file.

---

## Unix System Commands

There are various commands that you can use to manipulate files and directories. The most commonly used ones include:    

### `ls`
This command is used to list the contents of a directory. The directory is supplied as a parameter, for example to list the contents of the root folder:

```bash
$ ls /
```

The `ls` command takes a number of different parameters, but two of the most commonly used ones are

- `-a` to list all files and directories. Any entry with a `.` or dot as the first character is by default hidden when listing the contents of a directory.
- `-l` to list the long format of each entry. This is useful to see all the permissions and owners of a directory or file.


### `cd`
This command is used to change the current working directory. If a directory is specified, we change to that directory, otherwise we change to the user's home directory. Directory names can be absolute (starting with the root directory, or `/`) or relative, where we use two `.` characters to signal the parent directory of the current directory (one `.` character represents the current directory):

```bash
$ cd /notebooks
$ cd ..
```

### `pwd`
This command is used to find out the name of the current working directory.

```bash
$ pwd
```

### `touch`
This command is used to make a new, empty file, with the name specified on the command line. For example to make a new, empty file called _myName_:

```bash
$ touch myName
```

### `mkdir`
This command is used to make a new directory, with the name specified on the command line. Note that this might require superuser privileges. For example, to make a new directory called _mytest_:

```bash
$ mkdir mytest
```

### `rmdir`
This command is used to remove an empty directory. Note that this might require superuser privileges. For example, to delete a directory called _mytest_:

```bash
$ rmdir mytest
```

### `rm`
This command is used to remove files or directories. To forcibly remove all entries (including non-empty directories) you can use the `-rf` flag. For example, to remove _myName_:

```bash
$ rm myName
```

---

## Opening Files

Viewing the contents of a file is a rudimentary but nonetheless integral step to working with data. With a graphical interface, you might open Microsoft Word and then load a file into it to be edited. However, at the command line, we use a Unix command to open files for reading and to display file contents to `stdout` (which is usually your monitor).

Here are some useful commands for viewing files:

### `cat`

This command is used to view the entire contents of a file. For example, let's say we have a file called myName and want to send its contents to `stdout`, which in this case is the terminal display:

```bash
$ cat myName
```

### `less`

We use this command to view contents of a file one screen at a time.`less` is a more recent version of the `more` command, which can also be used. For example, to page through the contents of myName (using the spacebar to go to the next screen, or the `b` key to go back one screen):

```bash
$ less myName
```

### `head`

This command is used to view a limited number of lines from the start (or head) of the file. By default, the first 10 lines will be displayed, but you can specify the exact number by using the `-n num` flag, where _num_ is the number of lines to display. For example, to display the first five lines from myfile:

```bash
$ head -5 myName
```

### `tail`

This command is used to view a limited number of lines from the end (or tail) of the file. By default, the first 10 lines will be displayed, but you can specify the exact number by using the `-n num` flag, where _num_ is the number of lines to display. For example, to display the last three lines from myfile:


```bash
$ tail -3 myName
```

Another option for the `tail` command is the `-f` flag, which displays the last lines of a file that could be continually updated (e.g., the output of another command).

-----

# File I/O with Python

---

we're going to build on the foundation provided by the previous tutorial to introduce how to read and write data to and from files. This is an important skill since we often want to share our results or will need to rerun analyses, both of which are made considerably easier when the data can be reused.

File manipulation is relatively simple with a graphical user interface like Microsoft Windows or macOS. However, file operations withing the context of a programming language are a little more complicated. 

When working with files, or any other [system object](https://docs.microsoft.com/en-us/dotnet/api/system.object?view=netcore-3.1), we be judicious with our management of the underlying resource. For this tutorial, that means a file and the symbolic link (or other file descriptor) that the host operating system uses to reference the file. While today's operating systems can typically manage a large number of symbolic links, when we use virtualization, as would be the case if you were hosting a Jupyter Notebook in the cloud, we want to minimize the resources thrown at the server.

So, we're going to learn how to properly open a file and write data into it. After that, we'll learn how to read and write data into delimited text files (e.g., comma separated value, or CSV, files). Lastly, we'll learn more about using Python modules or packages which can be used to add or extend the functionality of your Python programs. 

We'll start by running a Bash shell script to create a new directory for our data files, assuming it already hasn't already been created.

---

In [68]:
%%bash

# We're creating an absolute file path for our Bash script
# DIR=/home/medium_blog/data

# Below is the relative file path for Bash script
DIR=data

# Below, the -d flag evaluates to true if the file
# already exists and is in a directory. If not,
# we create it.
if [ ! -d "$DIR" ] ; then
    mkdir "$DIR"
fi

We can create a new file object simply by using the open method. We open (or create it if it doesn't already exist) this file by using the built-in open function and specifying the name of the file and the *mode* in which we want to open the file. 

Here are some of the available modes:

| Mode | Description                       |
| ---- | --------------------------------- |
| 'r'  | reading (default)                 |
| 'w'  | writing, truncate file first      |
| 'x'  | create and open file for writing  |
| 'a'  | writing, append to file if exists |
| 'b'  | binary mode                       |
| 't'  | text mode (default)               |
| '+'  | open for reading and writing      |

Notice how *r* and *t* have default in parentheses? That's because `rt` (reading text) is the default mode for reading text data. So, to open a text file named `medium.txt` for writing, you'd use `x = open('medium.txt', 'a')`. After you performed all the necessary operations, you would use `x.close()` to close the file and relinquish the associated resources.

In Python, file input and output employs a runtime [context](https://docs.python.org/3/reference/datamodel.html?highlight=context%20manager#with-statement-context-managers). According to the documentation, this means, "*The context manager handles the entry into, and the exit from, the desired runtime context for the execution of the block of code*."Loosely translated, this means that a context manager is a way to enforce what should happen when a code block is entered and exited. The *context* itself is created upon invocation of the `with` command; what is appended to the `with` command is what creates the actual context (which is what manages entry to, and exit from, the enclosed code block). For this tutorial, we're going not going to use a context for anything but opening and closing files. As seen in the next code block, we can now open a file, work on it, then close it without having to worry about any resource management issues (thanks to the context).

---


In [69]:
# Example of writing to a file

# An example of an absolute file path
# out_file = '/home/medium_blog/data/temp.txt'

# A relative file path
out_file = 'data/temp.txt'

# We add a newline character at the end of each string
with open(out_file, 'w') as fout:
    fout.write("Hello, Medium!\n")
    fout.write("Goodbye, Medium!\n")

In [70]:
# Example of reading a file
# Traditional method

f = open("data/temp.txt", "r")
print(f.read())

Hello, Medium!
Goodbye, Medium!



In [71]:
# Example of reading a file
# With method
with open('data/temp.txt', 'r') as fin:
    for line in fin:
        print(line)

Hello, Medium!

Goodbye, Medium!



---

## Remote Data Files

If you're reading this tutorial, it's likely that you'll want to work with data that's been created by others. To demonstrate working with externally produced data, we're going to obtain a list of S&P 500 companies from a remote website.

The first code cell provides the name of the file where I'm going to store data locally, on my own computer. The second code cell is a Bash script that first tests if the file exists locally on my Jupyter server, and if not, it uses the `wget` command to pull the file off someone else's server and have it sent to my server. Finally, we'll use the Unix `head` command to display the first ten lines of the file to verify that it's been retrieved successfully and to see the format of each row. In this case, the file employs commas to separate values from each other within a single row. This format is known as comma separated value, or CSV, and is a popular text format.

In [72]:
# We first name the file that contains our data of interest
data_file ='data/stonks.csv'

In [73]:
%%bash -s "$data_file"

# Note: we passed in a Python variable above to the Bash script, 
# which is then accessed via positional parameter, or $1 in this case.

# First test if file of interest does not exist
if [ ! -f "$1" ] ; then

    # If it does not exist, we grab the file from the Internet and
    # store it locally in the data directory

    wget -O "$1" https://datahub.io/core/s-and-p-500-companies/r/constituents.csv

else
    
    echo "File already exists locally."
fi

File already exists locally.


In [74]:
%%bash

# File open, method #1
# Display 1st 10 lines
head -10 data/stonks.csv

Symbol,Name,Sector
MMM,3M Company,Industrials
AOS,A.O. Smith Corp,Industrials
ABT,Abbott Laboratories,Health Care
ABBV,AbbVie Inc.,Health Care
ABMD,ABIOMED Inc,Health Care
ACN,Accenture plc,Information Technology
ATVI,Activision Blizzard,Communication Services
ADBE,Adobe Inc.,Information Technology
AAP,Advance Auto Parts,Consumer Discretionary


In [75]:
# File open, method #2
f = open("data/stonks.csv", "r")
print(f.read(346))

Symbol,Name,Sector
MMM,3M Company,Industrials
AOS,A.O. Smith Corp,Industrials
ABT,Abbott Laboratories,Health Care
ABBV,AbbVie Inc.,Health Care
ABMD,ABIOMED Inc,Health Care
ACN,Accenture plc,Information Technology
ATVI,Activision Blizzard,Communication Services
ADBE,Adobe Inc.,Information Technology
AAP,Advance Auto Parts,Consumer Discretionary



---

### Delimited Data

S&P 500 data is perfectly suited for spreadsheets; our file has one company name written in each row of the file, and the columns or fields in each row are separated by commas. This file format is known as a comma separated value or CSV file, and many spreadsheets will export data to this format. In this case, a comma is used as a **delimiter** for the different fields or columns in the file, but other delimiters can also be used.

Now that we have the **stonks** CSV file, we can read the data and process it accordingly. In the following code cell, the file is opened, and a list of companies in the IT sector is displayed in a specific format. To pull this off, we open the file using a context and assign the file to the variable `fin`(*file input*). Python treats this file input as an iterator, which allow us to access one line of text at a time, with the newline character (`\n`) denoted the end of it. This line of text is returned as a [string](https://www.w3schools.com/python/python_strings.asp#:~:text=Strings%20are%20Arrays,access%20elements%20of%20the%20string.), which is held in the `line` variable. We split the `line` string on our delimiter (which is a comma) into columns that are held in the `cols` list.

In this particular example, we check if the string for the Information Technology sector, which is `Information Technology`, is in the fourth column (i.e., `cols[3]`), and if so we print a nicely formatted string naming each company, its ticker symbol, and its location in a [GICS sector](https://www.msci.com/gics). The format string is specified at the top of the code cell and enables variable substitution by using the curly braces to indicate where the new text should be inserted for each row (i.e., {0} indicates where the first variable should be inserted into the string, etc.). The output shown after the code cell displays the data generated by running this script.


In [76]:
# We're using Python to read the file

# Here is our formatted print string
fString = "A company named {0} with the ticker symbol {1} is in the {2} sector."

print("Displaying Information Technology Companies")
print(120*'-')

# Now loop through the file, and display any company in IT sector
# Cach line is read in from the file as a Python string, which we tokenize
# (or split) on commas into a list of columns. We can extract specific columns to
# get the data of interest.

with open(data_file, 'r') as fin:
    for line in fin:
        cols = line.split(',')
        if 'Information Technology' in cols[2]:
            print(fString.format(cols[1], cols[0], cols[2],))


Displaying Information Technology Companies
------------------------------------------------------------------------------------------------------------------------
A company named Accenture plc with the ticker symbol ACN is in the Information Technology
 sector.
A company named Adobe Inc. with the ticker symbol ADBE is in the Information Technology
 sector.
A company named Advanced Micro Devices Inc with the ticker symbol AMD is in the Information Technology
 sector.
A company named Akamai Technologies Inc with the ticker symbol AKAM is in the Information Technology
 sector.
A company named Amphenol Corp with the ticker symbol APH is in the Information Technology
 sector.
A company named Analog Devices Inc. with the ticker symbol ADI is in the Information Technology
 sector.
A company named ANSYS with the ticker symbol ANSS is in the Information Technology
 sector.
A company named Apple Inc. with the ticker symbol AAPL is in the Information Technology
 sector.
A company named Applied 

---
## Python Modules & Packages

The more popular a programming language or web application framework becomes, the more individuals and organizations invest time and money into developing applications for them. Thanks to [Guido van Rossum](https://en.wikipedia.org/wiki/Guido_van_Rossum), Python supports encapsulating code into [modules](https://docs.python.org/3/tutorial/modules.html), which are just files consisting of definitions for functions, classes, or variables.*Modules* can be imported into other Python files, which means these definitions can be reused. (Code reusability is essential to good software engineering.)

Do you know what happens when modules get really popular? They're bundled into a Python **package**. Importing packages (or modules) into other Python programs is as simple as using the `import` statement, which takes multiple forms:

1. `import numpy`
2. `import numpy as np`
3. `from numpy import linspace`
4. `from numpy import *`


The first import statement pulls everything from the numpy package into the current program, but leaves all items in the numpy [namespace](https://en.wikipedia.org/wiki/Namespace#:~:text=A%20namespace%20in%20computer%20science,associated%20only%20with%20that%20namespace.). This means if you want to refer to a particular definition, like `linspace`, you have to use the `numpy` prefix, as in `numpy.linspace()`. The second import statement is similar to the first, but the prefix has been shortened to `np`. The third import statement only imports the single definition which is also brought into the current namespace, which means it doesn't require a prefix. The last import statement pulls the entire contents of the _numpy_ package into the current file and namespace. Unfortunately, this increases the chances for name collisions, so perhaps it might be best to avoid this particular import statement.

Many popular packages have been included with the standard Python distributions and are known collectively as the Standard Library. Other packages are available from third parties, yet can be very useful in specific circumstances. Here are some of the more popular Python packages for data science:

### Data Processing

- [NumPy](https://numpy.org/) (Fast numerical arrays and matrices)
- [SciPy](https://www.scipy.org/)(Great for engineering and statistics)
- [Pandas](https://pandas.pydata.org/)(Data structures and tools for data analysis)
- [scikit-learn](https://scikit-learn.org/) (Machine learning tools)
- [csv](https://docs.python.org/3/library/csv.html)(Read and write CSV files)

### Data Visualization

- [matplotlib](https://matplotlib.org/)(Plotting library)
- [Seaborn](https://seaborn.pydata.org/)(Interactive plotting)


There are many more amazing packages for the Python ecosystem, so be sure to check out [PyPI](https://pypi.org/), the official repository for public Python packages. These libraries can usually be installed with [pip](https://pypi.org/project/pip/), the Python package management tool. Unfortunately, pip is beyond the scope of this tutorial. 

---

## Next Up in this Tutorial

We'll get our hands dirty with some Python packages, like

- Pandas
- Numpy
- Matplotlib, and
- Seaborn

Hopefully, you're enjoying this beginner-level tutorial for Jupyter Notebooks, and you'll follow along as we learn more about data science!

Thank you for taking time out of your day to read this article! You can reach me at https://www.linkedin.com/in/charliesimms/.