# Week 3: Implementing Data Analysis Tools

## 3.1 Need for analysis tools

For simple data organization, spreadsheet software such as Microsoft Excel will suffice. However, the size of data handled in data science is generally large, so it is time-consuming to use a mouse to manipulate the data. In addition, Excel's data analysis and graphics creation functions are very poor.

Therefore, it is necessary to become familiar with tools other than spreadsheets. Now that data science and AI are widely used, there are many options for data analysis tools. First, SPSS, SAS, and MATLAB are well-known as paid software.

On the other hand, R and Python are the most popular free tools in the world. Online tools include Google Colaboratory, Mirosoft AzureML, and Tableau. These online tools are free of charge and have some limitations, but since they do not require installation on a computer, it is possible to try them out for the time being.

In this class, we will use analytical tools to provide a more intuitive understanding of data science theory and methods. Through the tools, students will actually manipulate data to summarize data and create graphics. Students will also develop hypotheses about the data, and actually go through the process of testing these hypotheses.

## 3.2 R and Python

As mentioned in the previous section, mouse operation is rather inconvenient when dealing with large data. What to do, then, is to use a **programming language**. In the data science field, there are programming languages focused on statistical analysis, among which R and Python are widely used.

[R](https://www.r-project.org/) is a programming tool dedicated to statistical analysis and graphics creation. Recently, it is often used in conjunction with [RStudio](https://rstudio.com/), an IDE (environment that makes programming easy to use).

Python, on the other hand, is a programming language that can be used for general purposes, but in the past few years, libraries (extensions that can be added later) focused on data analysis have become more extensive. In particular, Python provides an environment for easy analysis of large-scale data in deep learning.

In this lecture, exercises using Python will be conducted in parallel. Therefore, students are required to do either of the following:
1. set up a Python environment on your own computer
2. use Google Colaboratory (Google account required)

The instructor of the lecture class will tell you whether to choose 1. or 2.

First, we will explain how to install Python on your own computer. Next, we will explain how to use Google Colaboratory. Again, please follow the instructions of your class instructor to choose either 1. or 2. In the following, we will explain both so that you will be prepared for either case. It is not necessary to try both of the following on your own.

## 3.3 1. Installation of Python

There are multiple ways to install Python. The method also depends on the operating system (Windows, Mac, or Linux).

In this class, it is assumed that the following environment is installed on your computer:
1. Python version 3.6 or 3.7 is installed.
2. numpy, statsmodels, statistics, pandas, scipy, scikitlearn, matplotlib, and seaborn as additional libraries
3. jupyter as a tool for writing commands and displaying graphs (however, if you are familiar with other editors, jupyter is not necessary).

If you already have the environment described in 1, 2, and 3 above, or if you can set up your own environment, you may skip the following introduction.

### 3.3.1 Easiest way

- Installing [Anaconda](https://www.anaconda.com/) will provide an environment for Python, additional data analysis libraries, and jupyter.
Note that the following is a screenshot of the site, but the design of the site may have changed by the time you all form actually access it.

![](../images/anaconda_install.jpg)
![](../images/anaconda_install2.jpg)

The Anaconda home page has a Register link, but you do not need to register as a user (of course, you may register as a user).

For Windows, download the Python 3.7 installer for Windows (named Anaconda3-2020*-Windows-x86_64.exe, etc.) from the Download link in the upper right corner of the site (as of April 2020) (Download button) and double-click it (Graphical installer is not required). However, on Windows, it is better to choose **All Users** as the installation type on the way.

![](../images/anaconda_install3.jpg)

If you search the Internet, you will find many examples that instruct you to choose Just Me (recommended) (even the home page says so). However, Japanese users often use Japanese for their user name (login account notation). In this case, there may be a problem with Python operation.

However, if you choose **All Users**, you should right-click on the icon, select "Other" and then "Start as Administrator" when launching Anaconda applications.

Conversely, on a Mac, it is better to choose "Install for Personal Environment". This is to avoid conflicts because the Mac system has Python2 by default.

Furthermore, on Windows, you may not want to check the "Add Anaconda to the system PATH environment variable" checkbox (although it is likely to work in many cases, on a case-by-case basis). Below that, make sure "Register Anaconda as a system Python" is checked.

![](../images/anaconda_install4.jpg)

Finally, you will be asked to install PyCharm, which is not particularly necessary; PyCharm is an IDE, a programming aid.

If you want more detailed instructions, you can search Google for "Anaconda windows10 installation" or "Anaconda Mac installation," but you should try to find as much up-to-date information as possible by selecting "within 1 year" from the "Tools" menu under the search window.

## 3.4 Starting Jupyter

Select [Anaconda]->[jupyter notebook] from the Window's File menu. **However, if you have selected All Users when installing Anaconda on Windows, right-click on the icon named Jupyter and start as administrator as follows. **

![](../images/anaconda_install5.jpg)
![](../images/anaconda_install6.jpg)

When asked to specify a browser, choose Firefox, Chrome, or Edge.

Incidentally, when launching the command prompt (or terminal), also do the following. The command prompt is used to install additional libraries for Anaconda, for example.

Right-click on the Anaconda Prompt icon and run as administrator. In Python, you can install additional libraries by doing the following at the command prompt (terminal).

![](../images/anaconda_install7.jpg)

pip install janome

and press Enter.

The software installed here is Janome, a morphological analyzer (software that breaks Japanese sentences into words).

On a Mac, Anaconda-Navigator is installed in the "Applications" folder, so start it and launch Jupyter.

![](../images/jupyter_mac1.png)

To exit Jupyter, press Quit in the upper right corner of the Jupyter side of the browser. Or, press and hold down the Control key and press the letter "c" twice on the black terminal (command prompt) that appears at the same time as the startup.

Select [Python3] from the right menu [New] of the browser screen.

Type `import sys `, then break line and type `sys.version` again. Now, press Shift key and press Enter, and you should see the result below.

![](../images/Anaconda.png)

For more information on how to use Jupyter, please refer to the following sites.

[External site: 図解！Jupyter Notebookを徹底解説！(インストール・使い方・起動・終了方法)](https://ai-inter1.com/jupyter-notebook/)

## 3.5　2. About Google Colaboratory

Google Colaboratory is a ready-to-use Python environment that requires only a Google Account and a browser, and requires no installation. See below for details (use the arrow keys in the lower right corner).

[What is Google Colab](https://infoart.ait231.tokushima-u.ac.jp/DS/GoogleColab.html#/)

## 3.6　About Python

Programming languages are broadly divided into the compilation and scripting systems. Compilation means writing instructions and then converting them into machine language. The machine language becomes completely indecipherable to ordinary people (although there are a few who can read it), but the computer understands the machine language and can execute the instructions at high speed. To convert a file with written instructions (called a source file, etc.) into machine language, software called a compiler is needed. In C and Java, a compiler is used to translate a source file written by a human into machine language, and the computer executes the translated file.

On the other hand, in the scripting system, a file with written instructions can be executed by a computer as is. Although the execution speed is inferior to machine language, it is easy to try out the instructions. A file with written instructions is called a script. Python, Perl, Ruby, and R are script programming languages.

Script languages are said to be easier to write than compiled languages. For example, compiled languages strictly distinguish whether a number is a real number (real number, double, float) or an integer (integer). For example, in the compiled C language, if you want to treat the integer 2 as an integer, you need to assign int x =2; if it is a real number (floating point type), you need to assign double pi = 3.14159265359; or float pi = 3.14159265359f; (the assignments are explained in the following movie 1 below).

In scripting languages, on the other hand, you do not need to be so conscious of "type" when performing assignments; if X=2, whether it is an integer or a real number is often determined by the context (the relationship before and after the instruction).

The following figure shows an example of assigning 10 to `x` and dividing by 3 in C and Python.

![](../images/int_c.png)

In C, the type of a variable is important, and here the type is specified as int (integer). Therefore, the result of dividing by 3 is still an integer (the decimal point is not shown).

In Python, on the other hand, even if an integer is assigned to `x`, the result of the operation (division) is displayed as a real number.

![](../images/python_int.png)

Various languages are used in the fields of data science and artificial intelligence. In data analysis, compiler-based languages have been used mainly because of the load of data processing and computation, but libraries (modules) with faster speeds in C and other languages have been actively introduced in the background with Python and R, enabling faster and more efficient processing even in scripting languages. This has enabled fast and efficient processing even in scripting languages.
The scripting languages Python and R have simplified the sequence of writing and executing code (called development) by loosening type restrictions, etc., and in this sense, their work efficiency is extremely high. In addition, not only are there an abundance of modules that are strong in numerical and statistical processing, but they can also be applied to the construction of websites and other applications. In addition, it is open source, with no OS restrictions, free of charge, and free to use by anyone, and there are basically no restrictions on its development for commercial purposes. For this reason, Python and R are very popular in the data science and AI fields and are widely used around the world.

In general, when learning a programming language, it is necessary to learn its syntax, grammar, etc. For Python as well, it is good to know the basic specifications. However, in the field of data science, it is more important to learn how to use libraries for data science, even if you do not understand the Python language specification as a whole.

Therefore, it is recommended that students study the Python language specification on their own at external sites. There are many learning sites for Python on the Internet, including videos. The following are some examples. However, this does not mean that you have to study on these sites. Students are encouraged to find the content that suits them best and study it on their own.

### 3.6.1　External Sites

[Grammar (a rough introduction)](https://qiita.com/Fendo181/items/a934e4f94021115efb2e)

[Grammar (a thorough introduction to Python)](https://docs.python.org/ja/3/tutorial/index.html)

[Practical Guide: You can choose and watch it as a series 【Python講座】第1回 開発環境構築【独り言】](https://www.youtube.com/watch?v=Wb3Ps-w-kko&list=PLzEbEpZ4njvOi0h8qf6faBR2qn3Co7SYr&index=1)

## 3.7　Launching jupyter

When Python is installed using Anaconda, a programming tool called jupyter is installed at the same time.

- Select [Anaconda]->[jupyter notebook] from the Windows File menu. You may be asked to select a browser, such as Firefox, Chrome, or Edge.

- Mac Since Anaconda-Navigator is installed in the "Applications" folder, launch it and look for the Jupyter icon on the screen and launch it.

In both cases, the browser will start and the initial Jupyter screen will appear.

[The following operation is also explained in Video 1 (you can watch it by logging into the Microsoft service with your c account, e.g. c012345678@tokushima-u.ac.jp,)](https://web.microsoftstream.com/video/aed15923-3da7-4f92-9ad1-eb2aece95bbd)

<iframe width="640" height="360" src="https://web.microsoftstream.com/embed/video/aed15923-3da7-4f92-9ad1-eb2aece95bbd?autoplay=false &amp;showinfo=true" allowfullscreen style="border:none;"></iframe>

Now, select **[Pytnon3]** from the **[New]** right menu, and a new Jupyter notebook will open in a new tab. The area surrounded by green on this screen is called a **cell**, where you enter Python instructions. By the way, in the programming language, instructions are called **code**.

Let's enter some data consisting of six numbers (these numbers have no particular meaning).

## 3.8　Data

In [None]:
X = [11, 18, 14, 16, 15, 19]
X

Enter `X` in uppercase alphabetical characters on the left side, followed by an equal sign and a numeric value separated by a comma within the square brackets on the right side. This will express that `X` is a data set of six numbers. Or you can express `X` as **assign** (assign), for example.

Also, in Python, data that is a collection of numbers is called a **list**. In general, programming languages also refer to the above `X` as an **object**. In other programming languages for statistics, it is also called a vector, an array, or a series.

Now, let's manipulate this data `X`. Since you assigned yourself a set of numbers, you know how many elements it has, but let's use Python's instructions (code) to check it.

In [None]:
len(X)

An object with round brackets like `len()` is called a **function**(function), where len stands for length. The function `len()` is asking for the **length** of the object `X`, and in programming languages, the number of elements is also expressed as "length".

In programming languages, the data is specified inside the round brackets that follow the function name. In this example, `X` is specified. The object to be specified in the round brackets is also called an **argument**. In other words, the above example was executed by specifying `X` as an argument to the function `len()`.

Now, let's find the average value of the numbers in `X`. The average value can be obtained by summing up all the numbers and dividing by the number of numbers.

To do so, we first calculate the sum of the elements of `X`. To do this, use the Python function `sum()`.

In [None]:
sum(X) 

To obtain the average, divide this total by the number of elements.

In [None]:
sum(X) / len(X)

In computers, division is represented by a slash.

## 3.9　Libraries for Data Science

Now, one of the reasons why Python is widely used in data science is that it has well-developed functions for statistics.
For example, to find the average, I created an instruction to sum the elements of the data and divide by the number of elements.

If you install Python's statistical libraries, you will not have to write the formula for finding the average yourself.

Two well-known Python statistical libraries are **numpy** and **pandas**. Both **numpy** and **pandas** are not original features of the Python programming language, but extensions developed and released by Python users.

Therefore, after installing Python, users need to install these extensions by themselves. However, in this class, it is recommended to install Python using Anaconda software, in which case both **numpy** and **pandas** are installed at the same time as Python3.

Let's actually use them.

### 3.9.1　numpy

First, using the **numpy** package, we re-create the object `X` again and try to find the average value.

[The following operations are also explained in video 2 (you can watch it by logging into the Microsoft service with the c account, e.g. c012345678@tokushima-u.ac.jp,)](https://web.microsoftstream.com/video/0cc9a827-cd67-409b-b1e7-6722d120a535)

<iframe width="640" height="360" src="https://web.microsoftstream.com/embed/video/0cc9a827-cd67-409b-b1e7-6722d120a535?autoplay=false&amp;showinfo=true" allowfullscreen style="border:none;"></iframe>

In [None]:
import numpy as np
X = np.array([11, 18, 14, 16, 15, 19])
X

To use additional extensions, Python uses the command `import`. Normally, to use **numpy** features, you need to type the string `numpy.` every time you write code. On the other hand, if you type `import numpy as np` at the beginning of your code, you will only need to prefix your code with `np.` to use the **numpy** library functions from now on, which will save you a little bit of typing.

In this example, the function `np.array()` is used to create an **array** called `X`. Then, by adding `.mean()` after the array `X`, you can get the mean value of `X`.

In [None]:
X.mean()

By the way, mean is the average value in English. The mean is a value that is representative of the data. However, mean is not the only representative value of data. There are also median and mode. The details will be explained in the fourth lecture, but average in English corresponds to one of these three representative values (each word begins with m, so it is also called 3M, etc.).

### 3.9.2　pandas

Let's run the code we ran with **numpy** using the **pandas** library.

[The following operation is also explained in video 3 (you can watch it by logging into the Microsoft service with a c account, e.g. c012345678@tokushima-u.ac.jp,)](https://web.microsoftstream.com/video/0cc9a827-cd67-409b-b1e7-6722d120a535)

<iframe width="640" height="360" src="https://web.microsoftstream.com/embed/video/ae789a0d-e713-4ff1-a20c-21124b188bc4?autoplay=false&amp;showinfo=true" allowfullscreen style="border:none;"></iframe>

In [None]:
import pandas as pd
X = pd.Series([11, 18, 14, 16, 15, 19])
X

Notice that **pandas** creates an object `X` named series, but unlike **numpy**, the numbers are displayed vertically.

(Note that a warning message may appear in yellow on Windows, but you can ignore it here.)

The method for finding the average is the same as for **numpy**.

In [None]:
X.mean()

## 3.10　DataFrame

A very important concept in data science programming is the data frame. A data frame is a data format similar to a sheet in Excel, in essence, summarizing data in a rectangle or square.

For example, a data frame could be created as follows

In [None]:
dat = pd.DataFrame({'X': [11, 18, 14, 16, 15, 19],
                  'Y': ['A', 'A', 'B','B', 'C','C'],
				    'Z': [0.2, 0.8, 0.4, 0.6, 0.5, 0.9]})
dat

The data format is similar to an Excel sheet. This is called a **data frame**. This data frame has three columns of variables. When analyzing data in data science, we basically arrange the data in this format. The leftmost sequential number, starting from 0, is the row number of the data (called index in Python, but it is not part of the data). In most programming languages, the number of rows or columns starts from 0 instead of 1.

In practice, the data frame would be created by a process that reads the necessary data (file) from an Excel file or a database.

Let's try to find the average.

In [None]:
dat.mean()

Using `mean()`, the values in the `X` and `Z` columns are averaged, but the `Y` column is ignored. This is because the content of the `Y` column is a character, so the process of finding the average is meaningless.

However, there are cases where you want to find the average value for a certain group. For example, you may want to find the average of the height of men and women.

Let us group the `dat` data by the level of the `Y` column (in this case, "A", "B", "C") and find the mean value.

In [None]:
dat.groupby('Y').mean()

The function `groupby()` allows you to group the data by level in the `Y` column and to find the mean of each separately.

## 3.11　Creating Graphs

Next, here is an example of data manipulation and graphing. The package **seaborn** contains the data `titanic`. It contains data on the passengers of a luxury liner that collided with an iceberg and sank in the early morning of April 15, 1912, while sailing in the Atlantic Ocean, including passenger name, sex, age, cabin grade, etc., as well as information on the survival of the passengers.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
titanic = sns.load_dataset('titanic')

The first three lines declare the use of standard data preparation tools in Python data analysis; the fourth line is code that calls the Titanic data contained in a data preparation library called **seaborn**.

Let's look at an overview of the data.

In [None]:
titanic.info()

Listed here are the column names, their contents (whether they are numeric or categorical variables), and the number of pieces entered. The numerical values are not unified because some columns (variables) contain missing values (NA).

Check the number of rows and columns in the `titanic` data.

In [None]:
titanic.shape

Note that `shape` is not a function, but an element called an attribute (property or attribute), so it has not round bracketed.

The actual Titanic had over two thousand passengers, but this data has a subset of about 800.
Now let's check the number of survivors of these passengers.

In [None]:
titanic['survived'].value_counts()

To identify the column names in the data frame, place brackets after the object name (`titanic`) and specify the column names in single quotation marks. The function `value_counts()` is connected after the quotation marks to display the number of levels of the specified category variable. Here, 0 corresponds to death and 1 to survival.

Let's create a histogram of age.

In [None]:
titanic['age'].hist()

The histogram, which will be explained in Part 5 of the lecture, is, in essence, a bar graph showing the number of passengers in each age group.

Let us now check the number of passengers by sex.

In [None]:
sns.countplot('sex', data=titanic, palette='rainbow')

Let's check the number of people in each cabin grade.

In [None]:
sns.countplot('pclass', data=titanic, palette='rainbow')

Let's look at the relationship between cabin grade and age.

In [None]:
sns.factorplot(x='pclass',y='age', hue='sex', data=titanic, palette='pastel')
plt.show()

Let's check the relationship between the number of survivors and sex in a bar chart.

In [None]:
sns.countplot('survived', data=titanic, hue='sex', palette='rainbow')

Now, let's check the cabin grade, sex, and survival percentage.

In [None]:
titanic.pivot_table(index="pclass", columns= "sex", values="survived")

`pivot_table()` is an instruction to create a **contingency table** (cross table). A contingency table is a table in which each row and each column corresponds to a variable, and the number of persons (number of items) in each row and column is summarized.

<center>

|                  | Agree  | Disagree  |
|------------------|-------|-------|
| Age Group Under 20 | 10 | 1  |
| Age Group Up to 30 | 15 | 3  |
| Age Group Up to 40 | 25 | 35 |
| Age Group 50 and over | 30 | 12 |
|------------------|-------|-------|
</center>

The Titanic is a known data point that survivors vary considerably by sex, age, and cabin grade. Simply put, if the sex is female and the age is young, i.e., a child, many of them survive. Conversely, if they are male and older, many of them die. This data shows that adult males on board the Titanic took the initiative to save women and children at the cost of their own lives.

The Titanic data will be analyzed in more detail during the statistical modeling week of this lecture.

Finally, let's save the data we entered into Jupyter. Before we do that, let's check the default folder.

In [None]:
import os
os.getcwd()

The Jupyter notebook will be saved in the folder shown. Change Untilted next to the left logo of the Jupyter notebook to DataScience, etc., and then click the Save button. The next time you start Jupyter, you should see the Jupyter Notebook you created this time as DataScience.ipynb in the list of files and folders on the first screen. Click on it, and you can start typing from where you left off last time.

Press logout in the upper right corner of the Jupyter screen.

Also, on another window (Terminal or Command Prompt), hold down Control and press the letter C twice in succession to completely exit Jupyter.

For more information on how to use Jupyter Notebook, please refer to the following sites, for example.

[データ分析で欠かせない！Jupyter Notebookの使い方【初心者向け】](https://techacademy.jp/magazine/17430)

## 3.12　About the Quiz

After reading this content, you must answer the **"Week 3 Quiz (must be answered by all students)**. By answering the quiz, you will be considered to have attended the Week 3 lecture.

The quiz will include a number of questions on the basics of Python. Please study the sites recommended above. In particular, reading the text on the reference sites is not sufficient for "types," "arrays," "conditional branching," and "importing and using libraries. You should actually start Jupter (or access Google Colab), enter your own input, and execute it. Then, take the quiz with Jupyter running.