# NumPy

## The Basics of NumPy Arrays
Data manipulation in Python is nearly synonymous with NumPy array manipulation: even newer tools like Pandas are built around the NumPy array. This section will present several examples of using NumPy array manipulation to access data.

We'll cover a few categories of basic array manipulations here:

## Attributes of arrays:
- Determining the size, shape of an array
- Indexing of arrays: Getting and setting the value of individual array elements
- Slicing of arrays: Getting and setting smaller subarrays within a larger array

In [1]:
import numpy as np

In [2]:
a = np.array([9,10,17,3]) # One-dimensional array

b = np.array([[3, 5, 2, 4], 
              [7, 6, 8, 8],
              [1, 6, 7, 7]])  # Two-dimensional array

c = np.array([[8, 1, 5, 9], 
              [8, 9, 4, 3],
              [0, 3, 5, 0]])  # Two-dimensional array

In [3]:
print("Dimenssion of array a is : ", a.ndim)
print("Shape of array a is :", a.shape)
print("Size of array a is = ", a.size)

Dimenssion of array a is :  1
Shape of array a is : (4,)
Size of array a is =  4


In [4]:
print("Dimenssion of array b is : ", b.ndim)
print("Shape of array b is :", b.shape)
print("Size of array b is 3 * 4 = ", b.size)

Dimenssion of array b is :  2
Shape of array b is : (3, 4)
Size of array b is 3 * 4 =  12


### Operations on Numpy Array

In [5]:
#multiplying a constant with array
scalar =  c*3
print(scalar)

[[24  3 15 27]
 [24 27 12  9]
 [ 0  9 15  0]]


In [6]:
#adding b and c because they have same dimensions
add = b + c
print(add)

[[11  6  7 13]
 [15 15 12 11]
 [ 1  9 12  7]]


In [7]:
# element wise multipication
multiply = b*c 
print(multiply)

[[24  5 10 36]
 [56 54 32 24]
 [ 0 18 35  0]]


In [8]:
#taking square of every element in c
square = b**2
print(square)

[[ 9 25  4 16]
 [49 36 64 64]
 [ 1 36 49 49]]


In [9]:
#are elements of b > than c
logical= b > c
print(logical)

[[False  True False False]
 [False False  True  True]
 [ True  True  True  True]]


In [14]:
b

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [10]:
np.min(b)           # Minimum of the flattened array

1

In [11]:
np.min(b, axis=0)   # Minima along the first axis which means the minimum in first column,second column and so on.

array([1, 5, 2, 4])

In [12]:
np.min(b,axis = 1) # Minima along the second axis which means the minimum of first row,second row and so on.

array([2, 6, 1])

In [13]:
np.max(b)         # Maximum of the flattened array

8

### Array Indexing: Accessing Single Elements 

If you are familiar with Python's standard list indexing, indexing in NumPy will feel quite familiar. In a one-dimensional array, the $i^{th}$ value (counting from zero) can be accessed by specifying the desired index in square brackets, just as with Python lists:

In [19]:
a

array([ 9, 10, 17,  3])

In [21]:
a[0]

9

In [22]:
a[-1]

3

In a multi-dimensional array, items can be accessed using a comma-separated tuple of indices

In [23]:
b

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [24]:
b[0, 0]

3

In [25]:
b[2, -1]

7

In [26]:
b[0, 0] = 12
b

array([[12,  5,  2,  4],
       [ 7,  6,  8,  8],
       [ 1,  6,  7,  7]])

# Pandas

Pandas will be a major tool of interest throughout the course. It
contains data structures and data manipulation tools designed to make data cleaning
and analysis fast and easy in Python. Pandas is often used in tandem with numerical
computing tools like NumPy and SciPy, analytical libraries like statsmodels and
scikit-learn, and data visualization libraries like matplotlib.Throughout the rest of the notebook, I use the following import convention for pandas:
- *import pandas as pd* .
Thus, whenever you see pd. in code, it’s referring to pandas

In [27]:
import pandas as pd  #for data analysis and manipulation

### Importing data using Pandas Data Frame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string,boolean, etc.). The DataFrame has both a row and column index.

In [28]:
data = pd.read_csv('car_data.csv')

In [29]:
data

Unnamed: 0,Car_Name,Year,Selling_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.60,42450,Diesel,Dealer,Manual,0
...,...,...,...,...,...,...,...,...
296,city,2016,9.50,33988,Diesel,Dealer,Manual,0
297,brio,2015,4.00,60000,Petrol,Dealer,Manual,0
298,city,2009,3.35,87934,Petrol,Dealer,Manual,0
299,city,2017,11.50,9000,Diesel,Dealer,Manual,0


### Some important methods to manipulate and analyze data

In [30]:
data.head() #the head method selects only the first five rows

Unnamed: 0,Car_Name,Year,Selling_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,42450,Diesel,Dealer,Manual,0


We see that our data is mix of both Numeric and Categorical attributes. The attributes Selling_Price is in ($ 100,000). Let us see the summary of the data

In [31]:
data.describe() #describe method gives 8 point summary of our numeric data

Unnamed: 0,Year,Selling_Price,Kms_Driven,Owner
count,301.0,301.0,301.0,301.0
mean,2013.627907,4.661296,36947.20598,0.043189
std,2.891554,5.082812,38886.883882,0.247915
min,2003.0,0.1,500.0,0.0
25%,2012.0,0.9,15000.0,0.0
50%,2014.0,3.6,32000.0,0.0
75%,2016.0,6.0,48767.0,0.0
max,2018.0,35.0,500000.0,3.0


In [33]:
print("The shape of data is : ",data.shape) #to see number of rows and columns in data

The shape of data is :  (301, 8)


In [34]:
print(data.columns) #to display all the columns in our data

Index(['Car_Name', 'Year', 'Selling_Price', 'Kms_Driven', 'Fuel_Type',
       'Seller_Type', 'Transmission', 'Owner'],
      dtype='object')


In [35]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       301 non-null    object 
 1   Year           301 non-null    int64  
 2   Selling_Price  301 non-null    float64
 3   Kms_Driven     301 non-null    int64  
 4   Fuel_Type      301 non-null    object 
 5   Seller_Type    301 non-null    object 
 6   Transmission   301 non-null    object 
 7   Owner          301 non-null    int64  
dtypes: float64(1), int64(3), object(4)
memory usage: 18.9+ KB


In [36]:
data.isnull().sum() #isnull methods returns Boolean value whether data is missing and sum will return its count

Car_Name         0
Year             0
Selling_Price    0
Kms_Driven       0
Fuel_Type        0
Seller_Type      0
Transmission     0
Owner            0
dtype: int64

In [37]:
data["Kms_Driven"] #to select an attribute from data frame

0      27000
1      43000
2       6900
3       5200
4      42450
       ...  
296    33988
297    60000
298    87934
299     9000
300     5464
Name: Kms_Driven, Length: 301, dtype: int64

In [39]:
#The following example passes a function and checks the value of each element in series and returns low, normal or High accordingly.
def price_range(num): 
  
    if num < 5: 
        return "Low"
  
    elif num >= 5 and num < 10: 
        return "Normal"
  
    else: 
        return "High"

In [40]:
#take an attribute 
Selling_price = data["Selling_Price"]

In [41]:
Selling_price

0       3.35
1       4.75
2       7.25
3       2.85
4       4.60
       ...  
296     9.50
297     4.00
298     3.35
299    11.50
300     5.30
Name: Selling_Price, Length: 301, dtype: float64

In [42]:
# passing function to apply and storing returned series in new 
new = Selling_price.apply(price_range)

In [43]:
data['Price Category'] = new #create new column in the dataset

In [44]:
data

Unnamed: 0,Car_Name,Year,Selling_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner,Price Category
0,ritz,2014,3.35,27000,Petrol,Dealer,Manual,0,Low
1,sx4,2013,4.75,43000,Diesel,Dealer,Manual,0,Low
2,ciaz,2017,7.25,6900,Petrol,Dealer,Manual,0,Normal
3,wagon r,2011,2.85,5200,Petrol,Dealer,Manual,0,Low
4,swift,2014,4.60,42450,Diesel,Dealer,Manual,0,Low
...,...,...,...,...,...,...,...,...,...
296,city,2016,9.50,33988,Diesel,Dealer,Manual,0,Normal
297,brio,2015,4.00,60000,Petrol,Dealer,Manual,0,Low
298,city,2009,3.35,87934,Petrol,Dealer,Manual,0,Low
299,city,2017,11.50,9000,Diesel,Dealer,Manual,0,High


In [45]:
# Check distribution of categorical column
data['Price Category'].value_counts()

Low       193
Normal     80
High       28
Name: Price Category, dtype: int64