## Software Packages

We will be using a wide range of different Python software packages.  The following is a list of packages that are already installed on [ds100.lsit.ucsb.edu](ds100.list.ucsb.edu) that we will routinely use in lectures and homeworks:

### Linear Algebra

In [1]:
import numpy as np

### Data manipulation

In [2]:
import pandas as pd

### Visualization

In [3]:
import altair as alt 

# Starting with a Question: **Who are you (the students of DS100)?**

This is a pretty vague question but let's start with the goal of learning something about the students in the class.


## Data Acquisition and Cleaning 

**In DS100 we will study various methods to collect data.**

To answer this question, I downloaded the course roster and extracted everyones first name and major.  I'll also asign everybody a random number (who will have the largest?!)

In [5]:
students = pd.read_csv("roster.csv", index_col=0)

## Every student gets a random normal 
students['Random Number'] = np.random.randn(students.shape[0])

students.tail()

Unnamed: 0_level_0,Major,Random Number
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
STEPHANIE,STSDS,1.143859
FRANCIE,CMPSC,-1.246754
REGINA,CMPSC,-1.353835
YU,CMPSC,-0.024188
KEVIN,STSDS,1.07084


In [6]:
students.index

Index(['NICK', 'PRIYANKA', 'MEGAN', 'ANITA', 'ALEC', 'DAVID', 'TINA', 'AARON',
       'SOPHIE', 'ROBERTO', 'ROBIN', 'CALVIN', 'HAN', 'STEPHANIE', 'JASMINE',
       'DANA', 'KARSYN', 'YAQI', 'CHENKAI', 'JIAQI', 'MEIYU', 'ANNE',
       'ADHITYA', 'ZIQING', 'MELINDA', 'BRANDON', 'MISHA', 'ARI', 'LIA',
       'MITCHELL', 'CATHERINE', 'LERON', 'SARAH', 'TAYLOR', 'RYDER', 'SHUYUN',
       'MENG', 'JIANGHUA', 'JUSTIN', 'PUCHUAN', 'QUANSEN', 'JONATHAN',
       'PHILIP', 'JASON', 'LINA', 'EDDIE', 'JESSICA', 'WENXUAN', '#REF',
       'THIHA', 'TRUNG', 'JULIA', 'NATHAN', 'XINQI', 'YUEHAN', 'JINGWEN',
       'REBECCA', 'CAMERON', 'MAX', 'ERAN', 'NATHAN', 'JOSH', 'STEPHANIE',
       'FRANCIE', 'REGINA', 'YU', 'KEVIN'],
      dtype='object', name='Name')

In [7]:
students.columns

Index(['Major', 'Random Number'], dtype='object')

## How do we select individual features (columns)?

There are a few ways.

In [10]:
students['Major'].head()

Name
NICK        STSDS
PRIYANKA    CMPSC
MEGAN       ACTSC
ANITA       ENVST
ALEC        STSDS
Name: Major, dtype: object

---

## Exploratory Data Analysis

**In DS100 we will study exploratory data analysis and practice analyzing new datasets.**


### How many records do we have:
A good starting point is understanding the size of the data. 

#### Solution

In [11]:
print("There are", len(students), "students on the roster.")

There are 67 students on the roster.


### Is my data representative of the population I want to study?

**Answer:**

This is (or at least was) a complete **census** of the class containing all the official students.


### Understanding the structure of data

It is important that we understand the meaning of each field and how the data is organized.

In [12]:
students.sort_index(inplace=True)
students.head()


Unnamed: 0_level_0,Major,Random Number
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
#REF,#REF,0.67478
AARON,STSDS,-0.727915
ADHITYA,STSDS,0.811819
ALEC,STSDS,1.155445
ANITA,ENVST,1.405671


What is the meaning of the **Major** field?

**Solution** 
Understanding the meaning of field can often be achieved by looking at the types of data it contains (in particular the counts of its unique values).

In [13]:
students['Major'].__class__

pandas.core.series.Series

In [14]:
students['Major'].value_counts().to_frame()

Unnamed: 0,Major
STSDS,30
CMPSC,15
ACTSC,3
STSCI,3
MTHSC,3
PRMTH,2
CMPEN,2
ECON,1
CHEME,1
MATCS,1


It appears that one student has an erroneous major given as "#REF". What else can we learn about this student? Let's see their name.  We'll find this student using the `loc` method from pandas.

In [15]:
students['Major'] == '#REF'

Name
#REF          True
AARON        False
ADHITYA      False
ALEC         False
ANITA        False
ANNE         False
ARI          False
BRANDON      False
CALVIN       False
CAMERON      False
CATHERINE    False
CHENKAI      False
DANA         False
DAVID        False
EDDIE        False
ERAN         False
FRANCIE      False
HAN          False
JASMINE      False
JASON        False
JESSICA      False
JIANGHUA     False
JIAQI        False
JINGWEN      False
JONATHAN     False
JOSH         False
JULIA        False
JUSTIN       False
KARSYN       False
KEVIN        False
             ...  
MENG         False
MISHA        False
MITCHELL     False
NATHAN       False
NATHAN       False
NICK         False
PHILIP       False
PRIYANKA     False
PUCHUAN      False
QUANSEN      False
REBECCA      False
REGINA       False
ROBERTO      False
ROBIN        False
RYDER        False
SARAH        False
SHUYUN       False
SOPHIE       False
STEPHANIE    False
STEPHANIE    False
TAYLOR       False
THIHA  

In [16]:
students.loc[students['Major'] == "#REF"]

Unnamed: 0_level_0,Major,Random Number
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
#REF,#REF,0.67478


Though this single bad record won't have much of an impact on our analysis, we can clean our data by removing this record.

In [17]:
students = students.loc[students['Major'] != "#REF"]

Let's double check that our record removal only removed the single bad record.

In [18]:
students['Major'].value_counts().to_frame()

Unnamed: 0,Major
STSDS,30
CMPSC,15
ACTSC,3
STSCI,3
MTHSC,3
PRMTH,2
CMPEN,2
ECON,1
CHEME,1
MATCS,1


## Selecting Rows
We can select specific rows (observations of a data frame) using either `loc` or `iloc`.  What's the difference?

In [29]:
## Same result

#students['Random Number']

# display(students.iloc[[0, 5, 42]])
# print() ## Empty line
# display(students.loc[["AARON", "ARI", "PHILIP"]])


students
students.iloc[3, 1]

1.4056712725844451

`iloc` takes row numbers only (where `0` is the first row).  `loc` takes index names.  However, `loc` can also select rows using boolean arrays.

In [31]:
## students['Major'] == "ACTSC"

students.loc[students['Major'] == "ACTSC"]

Unnamed: 0_level_0,Major,Random Number
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
JESSICA,ACTSC,-0.290429
JUSTIN,ACTSC,0.214722
MEGAN,ACTSC,-1.037759


Row indices do not have to be unique

In [32]:
students.loc["STEPHANIE"]

Unnamed: 0_level_0,Major,Random Number
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
STEPHANIE,STSDS,1.143859
STEPHANIE,STSDS,-0.384954


In [33]:
students.iloc[[0, 1, 2]]

Unnamed: 0_level_0,Major,Random Number
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
AARON,STSDS,-0.727915
ADHITYA,STSDS,0.811819
ALEC,STSDS,1.155445


In [39]:
students['Random Number']
# students.Random Number

SyntaxError: invalid syntax (<ipython-input-39-87441a7d76f7>, line 2)

### Summarizing the Data

We will often want to numerically or visually summarize the data. The describe method provides a brief high level description of our data frame. 

In [40]:
students.describe()

Unnamed: 0,Random Number
count,66.0
mean,-0.090669
std,1.111918
min,-2.806421
25%,-0.866446
50%,-0.157308
75%,1.009611
max,1.76948


In [42]:
students.sort_values(by="Random Number", ascending=True)

Unnamed: 0_level_0,Major,Random Number
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
SARAH,STSDS,-2.806421
ERAN,CMPSC,-2.068628
LERON,CMPSC,-2.005270
YAQI,STSDS,-1.980737
ANNE,MTHSC,-1.920084
TRUNG,CMPSC,-1.736140
MAX,CMPSC,-1.635002
REGINA,CMPSC,-1.353835
THIHA,STSDS,-1.310898
JONATHAN,PRMTH,-1.261245


**In DS100 we will deal with many different kinds of data (not just numbers) and we will study techniques to diverse types of data.**

**How can we summarize the name field?** A good starting point might be to examine the lengths of the strings. 

In [50]:
students.index.str.len()

# students["Name Length"] = students.index.str.len() 
## students['Name Length']

Name
AARON        5
ADHITYA      7
ALEC         4
ANITA        5
ANNE         4
ARI          3
BRANDON      7
CALVIN       6
CAMERON      7
CATHERINE    9
CHENKAI      7
DANA         4
DAVID        5
EDDIE        5
ERAN         4
FRANCIE      7
HAN          3
JASMINE      7
JASON        5
JESSICA      7
JIANGHUA     8
JIAQI        5
JINGWEN      7
JONATHAN     8
JOSH         4
JULIA        5
JUSTIN       6
KARSYN       6
KEVIN        5
LERON        5
            ..
MENG         4
MISHA        5
MITCHELL     8
NATHAN       6
NATHAN       6
NICK         4
PHILIP       6
PRIYANKA     8
PUCHUAN      7
QUANSEN      7
REBECCA      7
REGINA       6
ROBERTO      7
ROBIN        5
RYDER        5
SARAH        5
SHUYUN       6
SOPHIE       6
STEPHANIE    9
STEPHANIE    9
TAYLOR       6
THIHA        5
TINA         4
TRUNG        5
WENXUAN      7
XINQI        5
YAQI         4
YU           2
YUEHAN       6
ZIQING       6
Name: Name Length, Length: 66, dtype: int64

## Visualizing name length in Altair

In [51]:
# pseuocode alt.Chart(pandas dataframe).mark_[type of chart].encode()


alt.Chart(students).mark_bar().encode(
     x = "Name Length",
     y = "count()"
)