# Assignment 1 Subtask 1

## File Types

#### 1. CSV
    a. A CSV (Comma-Separated Values) file is a plain text file that stores tabular data
    (numbers and text) in plain text form, with each line representing a row of data and
    each field (or column) separated by a delimiter, commonly a comma.
    b. CSV files are not designed for efficient random access. If we need to access data at a
    specific location, we may need to read the file sequentially until we reach the
    desired position.
#### 2. TXT
    a. Regarding "txt" files, it typically implies plain text files without a specific structure or
       standardized format.
#### 3. Pickle
    a. Regarding "pickle" files, the pickle module allows us to serialize and deserialize
       Python objects, storing them in a “binary” format.
#### 4. Parquet
    a. Parquet is a columnar storage file format commonly used for big data processing
       frameworks like Apache Spark and Apache Hive.
#### 5. HDF
    a. HDF5 files have a hierarchical structure, allowing efficient complex data organization.
       This structure enables quick access to specific datasets or groups within the file.
#### 6. Feather
    a. Like Parquet, Feather uses a columnar storage format, where data from the same
       column is stored together. This facilitates fast and efficient access to specific columns,
       making it suitable for analytical queries.
#### 7. JSON
    a. JSON (JavaScript Object Notation) is a lightweight data-interchange format that is
       easy for humans to read and write and for machines to parse and generate.
#### 8. AVRO
    a. Apache Avro is a binary serialization format developed within the Apache Hadoop
       project. It is compact, fast, and designed for efficient data serialization.

## Read and Write time Analysis

#### 1. CSV

    a. Reading and writing operations in CSV files are generally efficient because the format
    is simple, and data can be sequentially processed line by line.
    b. Reading a CSV file involves parsing each line and splitting it based on the linear
    operation of the delimiter.
    c. Writing to a CSV file is also straightforward, as you can append lines to the file one at
    a time.

#### 2. TXT

    a. Reading and writing operations in plain text files are generally efficient. Operations
    involve the sequential processing of lines.

    b. Reading a text file typically involves reading each line one at a time, making it a
    linear operation.
    c. Writing to a text file is straightforward, as data can be appended or modified
    sequentially.

#### 3. Pickle

    a. Pickling (serialization) and unpickling (deserialization) operations can be time-
    efficient for complex data structures and objects.
    b. Pickle files are binary, and the serialization process captures the internal structure of
    Python objects, including their state.

#### 4. Parquet

    a. Reading and processing specific columns is faster than row-based storage formats
    like CSV or plain text.

#### 5. HDF

    a. HDF5 supports chunking, a mechanism where data is stored in fixed-size chunks. This
    can improve read and write performance, especially when working with large
    datasets, as it allows for selective access to specific parts of the data without reading
    the entire file.

#### 6. Feather

    a. Feather is designed to be a lightweight and fast serialization format. The binary
    format allows for rapid serialization and deserialization of data, contributing to
    efficient read and write operations.

#### 7. JSON

    a. Parsing JSON is a linear operation, making it efficient for reading and writing small to
    moderately sized datasets.

#### 8. AVRO

    a. Avro uses binary encoding, contributing to faster serialization and deserialization
    than text-based formats like JSON. This is especially beneficial for large datasets and
    high-throughput scenarios.

## Space and Storage Analysis

#### 1. CSV

    a. CSV files are relatively space-efficient because they store data in a simple text format
    without additional overhead.
    b. Compared to binary formats, CSV files may occupy more space due to the human-
    readable nature of the format and the inclusion of text-based delimiters and quotes.

#### 2. TXT

    a. Like CSV files, plain text files are not optimized for efficient random access. If random
    access is required, reading the file sequentially may be necessary.
    b. Plain text files are generally space-efficient because they store data in a simple,
    human-readable format without additional formatting overhead.
    c. However, the lack of a standardized structure means that the space efficiency can
    vary based on how the data is organized within the file.

#### 3. Pickle

    a. pickle files are more suitable for random access than plain text or CSV files. Since the
    file contains a serialized representation of objects, you can selectively load specific
    objects without reading the entire file.

    b. Pickle files can be more space-efficient than plain text files because the binary
    format is more compact.
    c. The serialized format includes information about the object's structure, allowing for
    efficient representation of complex data types.

#### 4. Parquet

    a. Parquet files store data in a columnar format, meaning values from the same column
    are stored together. This can lead to significant performance improvements,
    especially for analytics and queries that select specific columns.
    b. The columnar storage format of Parquet files contributes to space efficiency by
    reducing the storage required for duplicate values within a column.

#### 5. HDF

    a. The chunking mechanism in HDF5 improves time efficiency and can contribute to
    space efficiency. It allows for efficient storage of large datasets by breaking them into
    manageable chunks.
    b. HDF5 allows the creation of virtual datasets, defined as references to data stored
    elsewhere in the file. This feature supports efficient storage and organization of data
    without duplication.

#### 6. Feather

    a. Like Parquet, Feather uses a columnar storage format, where data from the same
    column is stored together. This facilitates fast and efficient access to specific columns,
    making it suitable for analytical queries.
    b. The columnar storage format of Feather contributes to space efficiency by reducing
    redundant storage of similar values within a column.

#### 7. JSON

    a. JSON is relatively quick to serialize (convert objects to JSON format) and deserialize
    (convert JSON format to objects) due to its simple and text-based structure.
    b. JSON is a text-based format, so it may not be as space-efficient as binary formats.
    Text-based formats tend to have more overhead due to the inclusion of characters
    like curly braces, colons, and quotes.

#### 8. AVRO

    a.Avro's binary format is designed to be compact, resulting in smaller file sizes
    compared to text-based formats. The compactness is achieved through efficient
    encoding of data types and minimal metadata overhead.


## Graph

![](./media/SBIN.png)

### RISHIT JAKHARIA

In [3]:
import sys
from datetime import date
import pandas as pd
import numpy as np
from jugaad_data.nse import stock_df
import seaborn as sns
import ta
import matplotlib.pyplot as plt

In [3]:
def graph(macd_line, signal_line):
    plt.figure(figsize = (12, 6))
    sns.set_style('dark')
    sns.set_theme('paper')
    sns.lineplot(x = macd_line.index, y = macd_line, label = 'macd_line (SBIN)')
    sns.lineplot(x = signal_line.index, y = signal_line, label = 'signal_line (SBIN)')
    diff = signal_line-macd_line
    # sns.barplot(x = diff.index, y = diff, palette= ['red', 'green'], hue)
    plt.bar(diff.index, diff, color=['green' if val > 0 else 'red' for val in diff], width=1)

In [4]:
def linear(test_data, data):
    df = pd.DataFrame()
    df['P. CLOSE'] = data['PREV. CLOSE']
    df['P. OPEN'] = data['OPEN'].shift(1) 
    df['P. VWAP'] = data['VWAP'].shift(1)
    df['P. LOW'] = data['LOW'].shift(1) 
    df['P. HIGH'] = data['HIGH'].shift(1)
    df['P. NO OF TRADES'] = data['NO OF TRADES'].shift(1)
    df['OPEN'] = data['OPEN']
    df['ONES'] = 1
    df = df.drop(index=0)
    matrix = np.array(df)

    x_matrix = np.roll(matrix, shift=1, axis=1)
    y_matrix = data['CLOSE'].drop(index=0)
    
    x_transpose = x_matrix.transpose()
    x_transpose_x = np.array(x_transpose.dot(x_matrix))
    inverse = np.linalg.inv(x_transpose_x)
    x_transpose_y = np.array(x_transpose.dot(y_matrix))
    params = np.array(inverse.dot(x_transpose_y))
    
    print(params)

In [31]:
def pairs(df1, df2):
    spread = pd.DataFrame(df1['CLOSE'] - df2['CLOSE'])
    sq_spread = spread*spread
    roll = spread.rolling(window = 20).mean()
    sumOfSquares = sq_spread.rolling(window = 20).sum()
    
    variance = (sumOfSquares/20 - (roll*roll))
    sd = variance**0.5
    z_score = np.array((spread-roll)/sd)
    stock = 0
    cashflow = 0
    for i, score in enumerate(z_score):
        if score > 2 and stock > -5:
            stock -= 1
            cashflow += df1['CLOSE'][i]
            cashflow -= df2['CLOSE'][i]
        if score < -2 and stock < 5:
            stock += 1
            cashflow -= df1['CLOSE'][i]
            cashflow += df2['CLOSE'][i]
        print(cashflow)

In [32]:
# ------------------------------------ generation of data frame --------------------------------------------
def generate_dataframe(symbol, today, lastday, train_data):
    # formatting the date
    to_day = int(today[:2])
    to_month = int(today[3:5])
    to_year = int(today[6:10])
    
    la_day = int(lastday[:2])
    la_month = int(lastday[3:5])
    la_year = int(lastday[6:10])
    
    df = pd.DataFrame(stock_df(symbol=symbol, from_date=date(to_year, to_month, to_day), to_date=date(la_year, la_month, la_day), series="EQ"))
    df = df[[ "DATE", "CLOSE", "HIGH", "LOW", "PREV. CLOSE", "VWAP", "NO OF TRADES", "OPEN"]]
    df = df.iloc[::-1]
    df['DATE'] = pd.to_datetime(df['DATE'], format='%d-%m-%Y')
    df['DATE'] = df['DATE'].dt.strftime('%d/%m/%Y')
    if (train_data == "0"):
        df.to_csv("Stocks/"+symbol+".csv", index=False)
    else:
        df.to_csv("Stocks/"+symbol+"_train.csv", index = False)

# ----------------------------------------------- pickle --------------------------------------------------------
def write_pickle(DATA, symbol):
    pd.to_pickle(DATA, symbol + ".pkl")

# -------------------------------------------------------- MAIN -----------------------------------------------------------------------
def main():
    arguments = ["SBIN", "ADANIENT"]
    lastday = "01/01/2024"
    today = "05/12/2022"
    train_data = "0"
    for i, argument in enumerate(arguments):
        generate_dataframe(argument, today, lastday, train_data)
    #write_pickle(DATA, "Stocks/" + argument)
    
    df1 = pd.read_csv('Stocks/SBIN.csv')
    df2 = pd.read_csv('Stocks/ADANIENT.csv')
    pairs(df1, df2)

if __name__ == "__main__":
    main()

          DATE   CLOSE    HIGH     LOW  PREV. CLOSE    VWAP  NO OF TRADES  \
0   05/12/2022  617.30  618.00  607.55       607.55  614.05        210728   
1   06/12/2022  608.95  619.80  607.80       617.30  613.29        190699   
2   07/12/2022  607.05  612.90  604.50       608.95  607.96        113416   
3   08/12/2022  611.65  613.80  607.15       607.05  611.14        146277   
4   09/12/2022  616.50  618.00  609.10       611.65  613.84        169991   
5   12/12/2022  613.05  618.70  611.00       616.50  613.74        135955   
6   13/12/2022  616.75  617.40  612.50       613.05  615.45        162670   
7   14/12/2022  625.50  626.75  617.50       616.75  623.52        187174   
8   15/12/2022  615.95  629.55  614.30       625.50  621.92        148853   
9   16/12/2022  603.35  615.60  602.10       615.95  607.83        160875   
10  19/12/2022  604.45  609.50  603.00       603.35  605.78        149442   
11  20/12/2022  604.45  606.50  599.55       604.45  602.65        128701   

In [28]:
df.sort

KeyError: "Column(s) ['A', 'B'] do not exist"