# <div class="alert alert-block alert-success">
# In this tutorial, we’ll walk through using  <FONT COLOR="RED">NumPy</FONT> to analyze  <FONT COLOR="RED">data on wine quality</FONT>. The data contains information on various attributes of wines, such as <FONT COLOR="RED">pH</FONT> and <FONT COLOR="RED">fixed acidity</FONT>, along with a <FONT COLOR="RED">quality score</FONT> between 0 and 10 for each wine. The quality score is the average of at least 3 human taste testers. As we learn how to work with NumPy, we’ll try to figure out more about the perceived quality of wine.
#  
# Dataset used in this lecture is avaialble at https://archive.ics.uci.edu/ml/datasets/Wine+Quality
# </div>

# Retrieving and averaging CSV data represented as lists of Lists

using Python and the csv package:
* Import the <FONT COLOR="RED">csv</FONT> library
* Open the <FONT COLOR="RED">winequality-red.csv</FONT> file
* * With the file open, create a new <FONT COLOR="RED">csv.reader</FONT> object
* * * Pass in the keyword argument delimiter=";" to make sure that the records are split up on the semicolon character instead of the default comma character
* * Call the <FONT COLOR="RED">list</FONT> type to get all the rows from the file
* * Assign the result to <FONT COLOR="RED">wines</FONT>


In [None]:
import csv
with open('../input/lecture3datasetwinequalityred/winequality-red.csv', 'r') as f:
    wines = list(csv.reader(f, delimiter=';'))

Once we’ve read in the data, we can print out <FONT COLOR="RED">the first 3 rows</FONT>:

In [None]:
print(wines[:3])

The data has been read into a list of lists. Each inner list is a row from the ssv file. As you may have noticed, each item in the entire list of lists is represented as a string, which will make it harder to do computations.

Next we will format the data into a table to make it easier to view. As you can see from the table above, we’ve read in three rows, the first of which contains column headers. Each row after the header row represents a wine. The first element of each row is the fixed acidity, the second is the volatile acidity, and so on. We can find the average quality of the wines. The below code will:

* Extract the last element from each row after the header row.
* Convert each extracted element to a float.
* Assign all the extracted elements to the list qualities.
* Divide the sum of all the elements in qualities by the total number of elements in qualities to the get the mean.

In [None]:
qualities = [float(item[-1]) for item in wines[1:]]
sum(qualities) / len(qualities)

# Converting Lists of Lists to Numpy 2-Dimensional Arrays

A <FONT COLOR="RED">**2-dimensional array**</FONT> is also known as a <FONT COLOR="RED">**matrix**</FONT>, and is something you should be familiar with. In fact, it’s just a different way of thinking about a list of lists. A matrix has rows and columns. By specifying a row number and a column number, we’re able to extract an element from a matrix.

In a NumPy array, the number of dimensions is called the <FONT COLOR="RED">**rank**</FONT>, and each dimension is called an <FONT COLOR="RED">**axis**</FONT>. So the rows are the first axis, and the columns are the second axis.

* Import the numpy package.
* Pass the list of lists wines into the array function, which converts it into a NumPy array.
* * Exclude the header row with list slicing.
* * Specify the keyword argument dtype to make sure each element is converted to a float. We’ll dive more into what the dtype is later on.

In [None]:
import csv
with open("../input/lecture3datasetwinequalityred/winequality-red.csv", 'r') as f:
    wines = list(csv.reader(f, delimiter=";"))
import numpy as np
wines = np.array(wines[1:], dtype=np.float)

In [None]:
wines

We can check the number of rows and columns in our data using the <FONT COLOR="RED">**shape**</FONT> property of NumPy arrays:

In [None]:
wines.shape

# Alternative: Using NumPy to Read In Files

It’s possible to use NumPy to directly read csv or other files into arrays. We can do this using the <FONT COLOR="RED">numpy.genfromtxt</FONT> function. We can use it to read in our initial data on red wines.

* Use the <FONT COLOR="RED">genfromtxt</FONT> function to read in the <FONT COLOR="RED">winequality-red.csv</FONT> file.
* Specify the keyword argument <FONT COLOR="RED">delimiter=";"</FONT> so that the fields are parsed properly.
* Specify the keyword argument <FONT COLOR="RED">skip_header=1</FONT> so that the header row is skipped.

In [None]:
wines = np.genfromtxt("../input/lecture3datasetwinequalityred/winequality-red.csv", delimiter=";", skip_header=1)

In [None]:
wines

# Indexing NumPy Arrays

We can use array indexing to select individual elements, groups of elements, or entire rows and columns. One important thing to keep in mind is that just like Python lists, <FONT COLOR="RED">NumPy is zero-indexed</FONT>, meaning that the index of the first row is 0, and the index of the first column is 0. If we want to work with the fourth row, we’d use index 3, if we want to work with the second row, we’d use index 1, and so on. We’ll again work with the wines array:

In [None]:
wines[0,0]

Since we’re working with a 2-dimensional array in NumPy, we specify 2 indexes to retrieve an element. **The first index is the row**, or axis 1, index, and **the second index is the column**, or axis 2, index. Any element in wines can be retrieved using 2 indexes.

In [None]:
wines[2,3]

**Slicing** also works.

In [None]:
wines[:3,3]

In [None]:
np.array([wines[0,3], wines[1,3], wines[2,3]])

We can select an **entire column** by specifying that we want all the elements, from the first to the last. We specify this by just using the colon (:), with no starting or ending indices. The below code will select the entire fourth column:

In [None]:
wines[:,3]

We selected an entire column above, but we can also extract an **entire row**:

In [None]:
wines[3,:]

If we take our indexing to the extreme, we can select the entire array using two colons to select all the rows and columns in wines. This is a great party trick, but doesn’t have a lot of good applications:

In [None]:
wines[:,:]

# Assigning Values To NumPy Arrays

We can also use indexing to assign values to certain elements in arrays. We can do this by assigning directly to the indexed value:

In [None]:
wines[1,5] = 10

We can do the same for slices. To overwrite an entire column, we can do this:

In [None]:
wines[:,10] = 50

The above code overwrites all the values in the eleventh column with 50.

In [None]:
wines[:,:]

# 1-Dimensional NumPy Arrays

NumPy is a package for working with multidimensional arrays. One of the most common types of multidimensional arrays is the 1-dimensional array, or vector. As you may have noticed above, when we sliced wines, we retrieved a 1-dimensional array. A 1-dimensional array only needs a single index to retrieve an element. Each row and column in a 2-dimensional array is a 1-dimensional array. Just like a list of lists is analogous to a 2-dimensional array, a single list is analogous to a 1-dimensional array. If we slice wines and only retrieve the third row, we get a 1-dimensional array:

In [37]:
third_wine = wines[3,:]

print(third_wine)

# N-Dimensional NumPy Arrays

there are cases when you’ll want to deal with arrays that have greater than 3 dimensions. One way to think of this is as a list of lists of lists. Let’s say we want to store the monthly earnings of a store, but we want to be able to quickly lookup the results for a quarter, and for a year. The earnings for one year might look like this:

In [38]:
year_one = [
    [500,505,490],
    [810,450,678],
    [234,897,430],
    [560,1023,640]
]

We can retrieve the earnings from January by calling year_one[0][0]. If we want the results for a whole quarter, we can call year_one[0] or year_one[1]. We now have a 2-dimensional array, or matrix. But what if we now want to add the results from another year? We have to add a third dimension:

In [40]:
earnings = [
    [
        [500,505,490],
        [810,450,678],
        [234,897,430],
        [560,1023,640]
    ],
    [
        [600,605,490],
        [345,900,1000],
        [780,730,710],
        [670,540,324]
    ]
]

In [41]:
year_one[0][0]

In [42]:
earnings = np.array(earnings)
earnings[0,0,0]

We can also find the shape of the array:

In [43]:
earnings.shape

If we wanted to get first quarter earnings from both years, we could do this:

In [44]:
earnings[:,0,:]

# NumPy Data Types

As we mentioned earlier, each NumPy array can store elements of a single data type. For example, <FONT COLOR="red">wines</FONT> contains only float values. **NumPy stores values using its own data types, which are distinct from Python types like <FONT COLOR="red">float</FONT> and <FONT COLOR="red">str</FONT>.** This is because the core of NumPy is written in a programming language called C, which stores data differently than the Python data types. NumPy data types map between Python and C, allowing us to use NumPy arrays without any conversion hitches.

You can find the data type of a NumPy array by accessing the dtype property:

In [45]:
wines.dtype

NumPy has several different data types, which mostly map to Python data types, like <FONT COLOR="red">float</FONT>, and <FONT COLOR="red">str</FONT>. You can find a full listing of NumPy data types here, but here are a few important ones:

* <FONT COLOR="red">float</FONT> — numeric floating point data.
* <FONT COLOR="red">int</FONT> — integer data.
* <FONT COLOR="red">string</FONT> — character data.
* <FONT COLOR="red">object</FONT> — Python objects.

Data types additionally end with a suffix that indicates how many bits of memory they take up. So <FONT COLOR="red">int32</FONT> is a 32 bit integer data type, and <FONT COLOR="red">float64</FONT> is a 64 bit float data type.

In [48]:
wines_int = wines.astype(int)

wines_int

# NumPy Array Operations

# # Single Array Math

In [52]:
wines[:,11] + 10

In [51]:
wines[:,11] += 10

wines

In [56]:
wines[:,11] *= 2

wines

# # Multiple Array Math

In [57]:
wines[:,11] + wines[:,11]

# NumPy Array Methods

In [58]:
wines[:,11].sum()

If we call sum across the wines matrix, and pass in **axis=0**, we’ll find the sums over the first axis of the array. This will give us the sum of all the values in every column. This may seem backwards that the sums over the first axis would give us the sum of each column, but one way to think about this is that the specified axis is the one “going away”. So if we specify axis=0, we want the rows to go away, and we want to find the sums for each of the remaining axes across each row:

In [59]:
wines.sum(axis=0)

We can verify that we did the sum correctly by checking the shape. The shape should be 12, corresponding to the number of columns:

In [60]:
wines.sum(axis=0).shape

If we pass in **axis=1**, we’ll find the sums over the second axis of the array. This will give us the sum of each row:

In [61]:
wines.sum(axis=1)

If we pass in axis=1, we’ll find the sums over the second axis of the array. This will give us the sum of each row:There are several other methods that behave like the sum method, including:

* numpy.ndarray.**mean** — finds the mean of an array.
* numpy.ndarray.**std** — finds the standard deviation of an array.
* numpy.ndarray.**min** — finds the minimum value in an array.
* numpy.ndarray.**max** — finds the maximum value in an array.

# Plot from Wines

In [83]:
wines = np.genfromtxt("../input/lecture3datasetwinequalityred/winequality-red.csv", delimiter=";", skip_header=1)

In [84]:
wines.shape

In [91]:
from matplotlib import pyplot as plt 
wineIds = np.arange(start=0, stop=1599, step=1)
qualities = wines[:, 11] 

f = plt.figure()
f.set_figwidth(25)
f.set_figheight(5)

plt.bar(wineIds, qualities, color = 'g', align = 'center') 

plt.title('Bar graph to represent Wines quality') 
plt.ylabel('Quality') 
plt.xlabel('Wine')  

plt.show()

In [97]:
from matplotlib import pyplot as plt 
import seaborn as sns

wineIds = np.arange(start=0, stop=1599, step=1)
qualities = wines[:, 11] 

f = plt.figure()
f.set_figwidth(25)
f.set_figheight(5)


# matplotlib histogram
plt.hist(qualities, color = 'blue', edgecolor = 'black',
         bins = 10)

# Add labels
plt.title('Histogram of Wine Quality')
plt.xlabel('Quality')
plt.ylabel('Wines')

In [100]:
# Import Library

import numpy as np 
import matplotlib.pyplot as plt

# Data Coordinates

wineIds = np.arange(start=0, stop=1599, step=1)
fixedAcidity = wines[:, 0] 
ph = wines[:, 8] 


f = plt.figure()
f.set_figwidth(25)
f.set_figheight(10)


# PLot

plt.plot(wineIds, fixedAcidity) 
plt.plot(wineIds, ph)

# Add Title

plt.title("Fixed acidity and pH data from the 1599 wines") 

# Add Axes Labels

plt.xlabel("Wines") 
plt.ylabel("Values") 

# Display

plt.show()

In [104]:
# Import Library

import numpy as np 
import matplotlib.pyplot as plt

# Data Coordinates

wineIds = np.arange(start=0, stop=1599, step=1)
fixedAcidity = wines[:, 0] 
ph = wines[:, 8] 
qualities = wines[:, 11] 


# Create figure and subplot

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
fig.set_figwidth(25)
fig.set_figheight(25)



# Plot

ax.scatter(fixedAcidity, ph, qualities, color='red')
ax.set_xlabel('fixedAcidity')
ax.set_ylabel('ph')
ax.set_zlabel('qualities')

# Display

plt.show()