#Python For Machine Learning
#By Imen Masmoudi

## Import NumPy and pandas modules

import the NumPy and pandas.

In [None]:
import numpy as np
import pandas as pd

#Numpy
**#Call np.array to create a NumPy matrix with your own values. For example, np.array creates an 8-element vector if you give it an 8-element list**

In [None]:
one_dimensional_array = np.array([1.2, 2.4, 3.5, 4.7, 6.1, 7.2, 8.3, 9.5])
print(one_dimensional_array)

In [None]:
[1,'i']

[1, 'i']

In [None]:
#You can also use np.array to create a two-dimensional matrix.
#To do so , just add an extra layer of square brackets.
#For example, the following call creates a 3x2 matrix:
two_dimensional_array = np.array([[6, 5], [11, 7], [4, 8]])
print(two_dimensional_array)

## **To populate a matrix with all zeroes, call np.zeros. To populate a matrix with all ones, call np.ones.**

In [None]:
zeros=np.zeros((5,2))
zeros

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [None]:
ones=np.ones((5,2))
ones

array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])

In [None]:
#You can populate an array with a sequence of numbers:
sequence_of_integers = np.arange(5, 12)
print(sequence_of_integers)
#note : the o/p took 5 in the input but didn't take 12

[ 5  6  7  8  9 10 11]


##NumPy provides various functions to populate matrices with random numbers across certain ranges. For example, np.random.randint generates random integers between a low and high value. The following call populates a 6-element vector with random integers between 50 and 100

In [None]:
random_integers_between_50_and_100 = np.random.randint(low=50, high=101, size=(6))
print(random_integers_between_50_and_100)
#Note that the highest generated integer np.random.randint is one less than the high argument.

[52 75 54 90 75 80]


In [None]:
#To create random floating-point values between 0.0 and 1.0, call np.random.random. For example:
random_floats_between_0_and_1 = np.random.random([6])
print(random_floats_between_0_and_1)

[0.08118665 0.32344449 0.53518196 0.70407442 0.25390511 0.98055868]


##Mathematical Operations on NumPy Operands
If you want to add or subtract two vectors or matrices, linear algebra requires that the two operands have the same dimensions. \\
Furthermore, if you want to multiply two vectors or matrices, linear algebra imposes strict rules on the dimensional compatibility of operands. Fortunately, NumPy uses a trick called broadcasting to virtually expand the smaller operand to dimensions compatible for linear algebra. For example, the following operation uses broadcasting to add 2.0 to the value of every item in the vector created in the previous code cell:

In [None]:
#The following operation also relies on broadcasting to add 2 to each cell:
random_floats_between_2_and_3 = random_floats_between_0_and_1 + 2.0
print(random_floats_between_2_and_3)

[2.08118665 2.32344449 2.53518196 2.70407442 2.25390511 2.98055868]


In [None]:
#The following operation also relies on broadcasting to multiply each cell in a vector by 3:
random_integers_between_150_and_300 = random_integers_between_50_and_100 * 3
print(random_integers_between_150_and_300)

[156 225 162 270 225 240]


##Let's Create a Linear Dataset
Our goal is to create a simple dataset consisting of a single feature and a label as follows:

Assigning a sequence of integers from 6 to 20 (inclusive) to a NumPy array named feature.
Assigning 15 values to a NumPy array named label such that:
   label = (3)(feature) + 4
For example, the first value for label should be:

  label = (3)(6) + 4 = 22

In [None]:
feature = np.arange(6, 21)
print(feature)
label = (feature * 3) + 4
print(label)

[ 6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
[22 25 28 31 34 37 40 43 46 49 52 55 58 61 64]


##Add Some Noise to the Dataset
To make our dataset a little more realistic, we will insert a little random noise into each element of the label array we already created. To be more precise, we will modify each value assigned to label by adding a different random floating-point value between -2 and +2.

Don't rely on broadcasting. Instead, create a noise array having the same dimension as label.

In [None]:
noise = (np.random.random([15]) * 4) - 2
print(noise)
label = label + noise
print(label)

[ 0.22820583  1.91598334 -1.89498556  0.23462696  1.83502187  0.86337003
 -1.93186993  0.40543334  1.69955882 -0.46169434 -0.76125308 -1.43170009
  1.19807362  1.61614382  0.13909204]
[22.22820583 26.91598334 26.10501444 31.23462696 35.83502187 37.86337003
 38.06813007 43.40543334 47.69955882 48.53830566 51.23874692 53.56829991
 59.19807362 62.61614382 64.13909204]


# Pandas DataFrame

This Part of the Colab introduces [**DataFrames**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which are the central data structure in the pandas API.

A DataFrame is similar to an in-memory spreadsheet. Like a spreadsheet:

  * A DataFrame stores data in cells.
  * A DataFrame has named columns (usually) and numbered rows.

## Creating a DataFrame

The following code creates a simple DataFrame containing 10 cells organized as follows:

  * 5 rows
  * 2 columns, one named `temperature` and the other named `activity`

This code instantiates a `pd.DataFrame` class to generate a DataFrame. The class takes two arguments:

  * The first argument provides the data to populate the 10 cells. The code cell calls `np.array` to generate the 5x2 NumPy array.
  * The second argument identifies the names of the two columns.

In [None]:
# Create and populate a 5x2 NumPy array.
my_data = np.array([[0, 3], [10, 7], [20, 9], [30, 14], [40, 15]])

# Create a Python list that holds the names of the two columns.
my_column_names = ['temperature', 'activity']

# Create a DataFrame.
my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names)

# Print the entire DataFrame
print(my_dataframe)

   temperature  activity
0            0         3
1           10         7
2           20         9
3           30        14
4           40        15


## Adding a new column to a DataFrame

You may add a new column to an existing pandas DataFrame just by assigning values to a new column name. For example, the following code creates a third column named `adjusted` in `my_dataframe`:

In [None]:
# Create a new column named adjusted.
my_dataframe["adjusted"] = my_dataframe["activity"] + 2

# Print the entire DataFrame
print(my_dataframe)

   temperature  activity  adjusted
0            0         3         5
1           10         7         9
2           20         9        11
3           30        14        16
4           40        15        17


## Specifying a subset of a DataFrame

Pandas provide multiple ways to isolate specific rows, columns, slices or cells in a DataFrame.

In [None]:
print("Rows #0, #1, and #2:")
print(my_dataframe.head(3), '\n')

print("Row #2:")
print(my_dataframe.iloc[[2]], '\n')

print("Rows #1, #2, and #3:")
print(my_dataframe[1:4], '\n')

print("Column 'temperature':")
print(my_dataframe['temperature'])

Rows #0, #1, and #2:
   temperature  activity  adjusted
0            0         3         5
1           10         7         9
2           20         9        11 

Row #2:
   temperature  activity  adjusted
2           20         9        11 

Rows #1, #2, and #3:
   temperature  activity  adjusted
1           10         7         9
2           20         9        11
3           30        14        16 

Column 'temperature':
0     0
1    10
2    20
3    30
4    40
Name: temperature, dtype: int64


## Let's Create a DataFrame

We will do the following:

  1. Create an 3x4 (3 rows x 4 columns) pandas DataFrame in which the columns are named `Eleanor`,  `Chidi`, `Tahani`, and `Jason`.  Populate each of the 12 cells in the DataFrame with a random integer between 0 and 100, inclusive.

  2. Output the following:

     * the entire DataFrame
     * the value in the cell of row #1 of the `Eleanor` column

  3. Create a fifth column named `Janet`, which is populated with the row-by-row sums of `Tahani` and `Jason`.

In [None]:
# Create a Python list that holds the names of the four columns.
my_column_names = ['Eleanor', 'Chidi', 'Tahani', 'Jason']

# Create a 3x4 numpy array, each cell populated with a random integer.
my_data = np.random.randint(low=0, high=101, size=(3, 4))

# Create a DataFrame.
df = pd.DataFrame(data=my_data, columns=my_column_names)

# Print the entire DataFrame
print(df)

# Print the value in row #1 of the Eleanor column.
print("\nSecond row of the Eleanor column: %d\n" % df['Eleanor'][1])

# Create a column named Janet whose contents are the sum
# of two other columns.
df['Janet'] = df['Tahani'] + df['Jason']

# Print the enhanced DataFrame
print(df)

   Eleanor  Chidi  Tahani  Jason
0       76     30      80      3
1        2     28      95      5
2       48     82      31     47

Second row of the Eleanor column: 2

   Eleanor  Chidi  Tahani  Jason  Janet
0       76     30      80      3     83
1        2     28      95      5    100
2       48     82      31     47     78


## Copying a DataFrame

Pandas provides two different ways to duplicate a DataFrame:

* **Referencing.** If you assign a DataFrame to a new variable, any change to the DataFrame or to the new variable will be reflected in the other.
* **Copying.** If you call the `pd.DataFrame.copy` method, you create a true independent copy.  Changes to the original DataFrame or to the copy will not be reflected in the other.

The difference is subtle, but important.

In [None]:
# Create a reference by assigning my_dataframe to a new variable.
print("Experiment with a reference:")
reference_to_df = df

# Print the starting value of a particular cell.
print("  Starting value of df: %d" % df['Jason'][1])
print("  Starting value of reference_to_df: %d\n" % reference_to_df['Jason'][1])

# Modify a cell in df.
df.at[1, 'Jason'] = df['Jason'][1] + 5
print("  Updated df: %d" % df['Jason'][1])
print("  Updated reference_to_df: %d\n\n" % reference_to_df['Jason'][1])

# Create a true copy of my_dataframe
print("Experiment with a true copy:")
copy_of_my_dataframe = my_dataframe.copy()

# Print the starting value of a particular cell.
print("  Starting value of my_dataframe: %d" % my_dataframe['activity'][1])
print("  Starting value of copy_of_my_dataframe: %d\n" % copy_of_my_dataframe['activity'][1])

# Modify a cell in df.
my_dataframe.at[1, 'activity'] = my_dataframe['activity'][1] + 3
print("  Updated my_dataframe: %d" % my_dataframe['activity'][1])
print("  copy_of_my_dataframe does not get updated: %d" % copy_of_my_dataframe['activity'][1])

Experiment with a reference:
  Starting value of df: 5
  Starting value of reference_to_df: 5

  Updated df: 10
  Updated reference_to_df: 10


Experiment with a true copy:
  Starting value of my_dataframe: 7
  Starting value of copy_of_my_dataframe: 7

  Updated my_dataframe: 10
  copy_of_my_dataframe does not get updated: 7


##From List, to DataFrame to csv

In [None]:
# List with pairs
data=[[i,j] for i,j in zip([0,1,2,3] , [4,5,6,7])]

# creating a Dataframe object
data=pd.DataFrame(data,columns=["X1","X2"])

data

Unnamed: 0,X1,X2
0,0,4
1,1,5
2,2,6
3,3,7


In [None]:
# dictionary with list object in values
details = {
    'Name' : ['Ankit', 'Aishwarya', 'Shaurya', 'Shivangi'],
    'Age' : [23, 21, 22, 21],
    'University' : ['BHU', 'JNU', 'DU', 'BHU'],
}

# creating a Dataframe object
df = pd.DataFrame(details)

df

Unnamed: 0,Name,Age,University
0,Ankit,23,BHU
1,Aishwarya,21,JNU
2,Shaurya,22,DU
3,Shivangi,21,BHU


In [None]:
# Creating a csv file from a dataframe

data.to_csv('file_name1.csv')

df.to_csv('file_name2.csv')

#This is just the beginning!
Let's GO!

https://colab.research.google.com/drive/1M9c4BVLIW1pXlMUYkzsYyPr8zyQA3QwA?usp=sharing