## Python Virtual Environment Setup

### Step - 1:
> To create a new virtual environment, open your Anaconda Prompt we shall get the default "base" environment. And run the following commands (substitute “myenv” with the virtual environment name of your choosing):

**conda create -n myenv python**<br/>
**conda activate myenv**

> On " File Manager" now goto the folder "C:\Users\<UserName>\anaconda3\envs", there you can find this newly created folder "myenv".

> Note that you have to specify the Python, otherwise your virtual environment will be created as a completely empty folder, without Python and some basic packages like pip pre-installed. You can check to make sure you’ve installed it properly by running conda list to see which packages are already in there

**conda list**

> now to check the Python version with the following command

**python --version**

> At this point, you can use conda or pip (preferred installer program) to install whichever package(s) you need in your newly activated virtual environment.

**conda install numpy**<br/>
**conda install pandas**<br/>
**conda install matplotlib**<br/>
**conda install scikit-learn**

> All these package(s) will be installed in the folder "C:\Users\<UserName>\anaconda3\envs\myenv\Lib\site-packages"

### Step - 2:
> When you run your project in Jupyter Notebook, you need a way to reference this new virtual environment instead of your base environment. In Anaconda Prompt (with your virtual environment still activated), run the following command. Keep in mind, you’ll have to run this in each virtual environment you create.

**conda install jupyter**

### Step 3.

> Then, in your base environment you need to install nb_conda_kernels (just one time), which automatically creates a new kernel for each virtual environment you create, provided you’ve installed jupyter in it.

**conda install -n base nb_conda_kernels**

### Step 4.

> Your Jupyter Notebooks can now run on either kernel (base or myenv), and therefore pull the correct packages/versions depending on the project at hand. You can launch Jupyter Notebook from within any activated environment with the command below, and it will open up your Notebooks location (mine are in C:/Users/myusername, which is where my Anaconda3 is also installed).

**jupyter notebook**

> With a new Jupyter notebook open, you can click Kernel > Change kernel > and select the virtual environment you need.

![VirtualEnvironment.png](attachment:VirtualEnvironment.png)

In [1]:
import numpy as np
import numpy as pandas
import numpy as matplotlib.pyplot as plt

Windows 64-bit packages of scikit-learn can be accelerated using scikit-learn-intelex.<br/>
More details are available here: https://intel.github.io/scikit-learn-intelex

For example:<br/>
**conda install scikit-learn-intelex** <br/>
**python -m sklearnex my_application.py**

### Step 5. (For PySpark Installation)

> To install PySpark, issue the following two commands at the Anaconda prompt:

**pip install findspark**<br/>
**pip install pyspark**

> For Google Colab issue the following command in a Colab cell:

**! pip install pyspark**

### Run PySpark on Jupyter Lab or Jupyter Notebook

In [4]:
import findspark
findspark.init()

In [7]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('PySparkExamples') \
                    .getOrCreate()  
        # local[1] means how many cores local executor will consume, here it is 1

data = [("Amal", "Das", "India", "West Bengal"),("Kamal", "Singh", "India", "Hyderabad"), \
        ("Rabi", "Narayanan", "India", "Tamilnadu"),("Maria", "Jones", "India", "Karnataka")]
columns = ["Firstname", "Lastname", "Country", "State"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()
print(df.collect(), "\n")
print (df.schema)

+---------+---------+-------+-----------+
|Firstname| Lastname|Country|      State|
+---------+---------+-------+-----------+
|     Amal|      Das|  India|West Bengal|
|    Kamal|    Singh|  India|  Hyderabad|
|     Rabi|Narayanan|  India|  Tamilnadu|
|    Maria|    Jones|  India|  Karnataka|
+---------+---------+-------+-----------+

[Row(Firstname='Amal', Lastname='Das', Country='India', State='West Bengal'), Row(Firstname='Kamal', Lastname='Singh', Country='India', State='Hyderabad'), Row(Firstname='Rabi', Lastname='Narayanan', Country='India', State='Tamilnadu'), Row(Firstname='Maria', Lastname='Jones', Country='India', State='Karnataka')] 

StructType([StructField('Firstname', StringType(), True), StructField('Lastname', StringType(), True), StructField('Country', StringType(), True), StructField('State', StringType(), True)])


### Run PySpark on Google Colab

Open your Google Colab and create a Notebook and type this command in the first cell - <br/>
**! pip install pyspark**
> Now type the following code in the second cell and execute.

In [7]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
                    .appName('PySparkExamples') \
                    .getOrCreate()  
        # local[1] means how many cores local executor will consume, here it is 1

data = [("Amal", "Das", "India", "West Bengal"),("Kamal", "Singh", "India", "Hyderabad"), \
        ("Rabi", "Narayanan", "India", "Tamilnadu"),("Maria", "Jones", "India", "Karnataka")]
columns = ["Firstname", "Lastname", "Country", "State"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()
print(df.collect(), "\n")
print (df.schema)

+---------+---------+-------+-----------+
|Firstname| Lastname|Country|      State|
+---------+---------+-------+-----------+
|     Amal|      Das|  India|West Bengal|
|    Kamal|    Singh|  India|  Hyderabad|
|     Rabi|Narayanan|  India|  Tamilnadu|
|    Maria|    Jones|  India|  Karnataka|
+---------+---------+-------+-----------+

[Row(Firstname='Amal', Lastname='Das', Country='India', State='West Bengal'), Row(Firstname='Kamal', Lastname='Singh', Country='India', State='Hyderabad'), Row(Firstname='Rabi', Lastname='Narayanan', Country='India', State='Tamilnadu'), Row(Firstname='Maria', Lastname='Jones', Country='India', State='Karnataka')] 

StructType([StructField('Firstname', StringType(), True), StructField('Lastname', StringType(), True), StructField('Country', StringType(), True), StructField('State', StringType(), True)])


### Run PySpark on DataBricks

<br/>Search "tempmail" on Google. Login to "https://temp-mail.org/en/" <br/>
Search "databricks community login" on Google. Login to "https://community.cloud.databricks.com/" <br/>
> Fill up the registration form with all required details. Select **"Get started with Comminity Edition"** option as shown in the snap shot. Verify your mail and reset password. And start working.

![DataBricks-1-2.png](attachment:DataBricks-1-2.png)

> * Go to "Compute" menu. Create a cluster.<br/>
> * Goto "Settings" => "Admin Console" => "Advanced" Section => "Workspace settings" => Enable "DBFS File Borwser"<br/>
> * Open a workspace create a notebook. And type the following code. And run with "Ctrl + Enter"

In [7]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName('PySparkExamples').getOrCreate()  
        # local[1] means how many cores local executor will consume, here it is 1
        # Use local[x] when running in Standalone mode. x should be an integer value and should be 
        # greater than 0; this represents how many partitions it should create when using RDD (Resilient Distributed Dataset), 
        # DataFrame, and Dataset. Ideally, x value should be the number of CPU cores you have.
data = [("Amal", "Das", "India", "West Bengal"),("Kamal", "Singh", "India", "Hyderabad"), \
        ("Rabi", "Narayanan", "India", "Tamilnadu"),("Maria", "Jones", "India", "Karnataka")]
columns = ["Firstname", "Lastname", "Country", "State"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()
print(df.collect(), "\n")
print (df.schema)

+---------+---------+-------+-----------+
|Firstname| Lastname|Country|      State|
+---------+---------+-------+-----------+
|     Amal|      Das|  India|West Bengal|
|    Kamal|    Singh|  India|  Hyderabad|
|     Rabi|Narayanan|  India|  Tamilnadu|
|    Maria|    Jones|  India|  Karnataka|
+---------+---------+-------+-----------+

[Row(Firstname='Amal', Lastname='Das', Country='India', State='West Bengal'), Row(Firstname='Kamal', Lastname='Singh', Country='India', State='Hyderabad'), Row(Firstname='Rabi', Lastname='Narayanan', Country='India', State='Tamilnadu'), Row(Firstname='Maria', Lastname='Jones', Country='India', State='Karnataka')] 

StructType([StructField('Firstname', StringType(), True), StructField('Lastname', StringType(), True), StructField('Country', StringType(), True), StructField('State', StringType(), True)])


**Thus we can run PySpark in Jupyter Lab or Jupyter Notebook, Google Colab Cloud and Databricks Cloud.**

### Data Bricks File System (DBFS) commands:
**To list filesystem**<br>
%fs ls<br>
%fs ls dbfs:/FileStore/tables<br>

**To rename a file**<br>
%fs mv dbfs:/FileStore/tables/old_file_name dbfs:/FileStore/tables/new_file_name

**To delete a file from DBFS**<br>
dbutils.fs.rm("dbfs:/FileStore/tables/your_file_name")<br>
%fs rm dbfs:/FileStore/tables/your_file_name

**To delete a folder from DBFS**<br>
%fs rm -r dbfs:/FileStore/tables/your_folder_name

**to copy a file in DBFS**<br>
%fs cp dbfs:/FileStore/tables/old_file_name dbfs:/FileStore/tables/new_file_name

**To move a file to another folder in DBFS**<br>
%fs mv dbfs:/FileStore/tables/old_file_name dbfs:/FileStore/tables/target_folder

**To create a folder in DBFS**<br>
%fs mkdirs dbfs:/FileStore/tables/mydir