<a href="https://www.kaggle.com/code/ayushs9020/understanding-the-competition-asl?scriptVersionId=129180290" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Google American Sign Language

So Google is again here with another great competition **`Google - American Sign Language Fingerspelling Recognition`**

Lets first understand the competition in detail 

* **American Sign Language FingerSpelling** - $American$ $Sign$ $Language$ $FingerSpelling$ is a method used in sign language to spell out individual letters or words. It involves using specific handshapes and movements to represent each letter of the alphabet. $FingerSpelling$ is an important tool for communication and is often used to convey names, places, or unfamiliar words in $ASL$. It requires practice and skill to execute accurately and fluently, and NO!!, (middle finger is not included in this language)

<img src = "https://d.newsweek.com/en/full/1394686/asl-getty-images.jpg">

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/asl-fingerspelling/supplemental_metadata.csv
/kaggle/input/asl-fingerspelling/character_to_prediction_index.json
/kaggle/input/asl-fingerspelling/train.csv
/kaggle/input/asl-fingerspelling/supplemental_landmarks/371169664.parquet
/kaggle/input/asl-fingerspelling/supplemental_landmarks/369584223.parquet
/kaggle/input/asl-fingerspelling/supplemental_landmarks/1682915129.parquet
/kaggle/input/asl-fingerspelling/supplemental_landmarks/775880548.parquet
/kaggle/input/asl-fingerspelling/supplemental_landmarks/2100073719.parquet
/kaggle/input/asl-fingerspelling/supplemental_landmarks/1650637630.parquet
/kaggle/input/asl-fingerspelling/supplemental_landmarks/1471096258.parquet
/kaggle/input/asl-fingerspelling/supplemental_landmarks/86446671.parquet
/kaggle/input/asl-fingerspelling/supplemental_landmarks/897287709.parquet
/kaggle/input/asl-fingerspelling/supplemental_landmarks/333606065.parquet
/kaggle/input/asl-fingerspelling/supplemental_landmarks/2057261717.parquet
/kaggle/inpu

In [2]:
pip install --pre torcharrow -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

Looking in links: https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
Collecting torcharrow
  Downloading https://download.pytorch.org/whl/nightly/torcharrow-0.2.0a0.dev20230511-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.5/13.5 MB[0m [31m36.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing
  Downloading typing-3.7.4.3.tar.gz (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- \ done
[?25hCollecting numpy==1.21.4
  Downloading numpy-1.21.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m51.3 MB/s[0m eta [36m0:00:00[0m
Collecting pandas<=1.3.5
  Downloading pandas-1.3.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl 

In [3]:
import pandas as pd 
import torchdata
import torcharrow

# 1 | Train 

So this our `train data`, this data contains all the information related to the the `training` and `target` values. As of I think `file_id` , `sequence_id` , `participant_id`, are not that important for the model, and thus it would be better if we just remove them

In [4]:
train = pd.read_csv("/kaggle/input/asl-fingerspelling/train.csv")
train

Unnamed: 0,path,file_id,sequence_id,participant_id,phrase
0,train_landmarks/5414471.parquet,5414471,1816796431,217,3 creekhouse
1,train_landmarks/5414471.parquet,5414471,1816825349,107,scales/kuhaylah
2,train_landmarks/5414471.parquet,5414471,1816862427,0,hentaihubs.com
3,train_landmarks/5414471.parquet,5414471,1816909464,1,1383 william lanier
4,train_landmarks/5414471.parquet,5414471,1816967051,63,988 franklin lane
...,...,...,...,...,...
67282,train_landmarks/2118949241.parquet,2118949241,388192924,88,431-366-2913
67283,train_landmarks/2118949241.parquet,2118949241,388225542,154,994-392-3850
67284,train_landmarks/2118949241.parquet,2118949241,388232076,95,https://www.tianjiagenomes.com
67285,train_landmarks/2118949241.parquet,2118949241,388235284,36,90 kerwood circle


The `path` shows where the `parquet file` is located. You could also have just used the `file_id` for lacating the corresponding `parquet_file`, as if we split the `path` into `train_landmarks/` and the other part, we basically get the same value. 

We used `path` instead of the `file_id`. As file paths are easier to process when we try to find the data

The other column `phrase` is the target column, we will train for 

# 2 | Supplemental MetData

Suplement data looks like the same, its just we will test our model on this data 

In [5]:
supplemental_metdata = pd.read_csv("/kaggle/input/asl-fingerspelling/supplemental_metadata.csv")
supplemental_metdata

Unnamed: 0,path,file_id,sequence_id,participant_id,phrase
0,supplemental_landmarks/33432165.parquet,33432165,1535467051,251,coming up with killer sound bites
1,supplemental_landmarks/33432165.parquet,33432165,1535499058,239,we better investigate this
2,supplemental_landmarks/33432165.parquet,33432165,1535530550,245,interesting observation was made
3,supplemental_landmarks/33432165.parquet,33432165,1535545499,38,victims deserve more redress
4,supplemental_landmarks/33432165.parquet,33432165,1535585216,254,knee bone is connected to the thigh bone
...,...,...,...,...,...
52953,supplemental_landmarks/2100073719.parquet,2100073719,1090866442,239,want to join us for lunch
52954,supplemental_landmarks/2100073719.parquet,2100073719,1090966452,95,this phenomenon will never occur
52955,supplemental_landmarks/2100073719.parquet,2100073719,1091005846,40,the winner of the race
52956,supplemental_landmarks/2100073719.parquet,2100073719,1091011550,241,are you sure you want this


In [6]:
supplemental_metdata.drop(["file_id" , "sequence_id" , "participant_id"] , axis = 1 , inplace = True)
supplemental_metdata

Unnamed: 0,path,phrase
0,supplemental_landmarks/33432165.parquet,coming up with killer sound bites
1,supplemental_landmarks/33432165.parquet,we better investigate this
2,supplemental_landmarks/33432165.parquet,interesting observation was made
3,supplemental_landmarks/33432165.parquet,victims deserve more redress
4,supplemental_landmarks/33432165.parquet,knee bone is connected to the thigh bone
...,...,...
52953,supplemental_landmarks/2100073719.parquet,want to join us for lunch
52954,supplemental_landmarks/2100073719.parquet,this phenomenon will never occur
52955,supplemental_landmarks/2100073719.parquet,the winner of the race
52956,supplemental_landmarks/2100073719.parquet,are you sure you want this


# 3 | Parquet Files
Parquet files are a little bit complicated to work with. But you can find particular methods in `pandas` and `tensorflow`. 

`Pytorch` has a specialised library for **loading parquet data from pipelines/paths**. You can find it **[torch.datapipes.iter.ParquetDataFrameLoader](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.ParquetDataFrameLoader.html)**, You need to install `torcharrow` to use this library, `torcharrow` is not in stable release, but you can still install it using 
```
! pip install --pre torcharrow -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
import torcharraow
```
Though even after installing it is showing error to me, I dont know if it is the problem with my system only or not. If you find any leads, please tell me :)

Here is the **[Github](https://github.com/pytorch/torcharrow)** for `torcharrow`

Here is the code for loading the data
```
torchdata.datapipes.iter.ParquetDataFrameLoader(train["path"])
```

Now lets see how we are given the data 

I dont know but the `sample_data` is not shown in the actual release of the notebook. but can be seen in the `edit` mode. 

It has $1,73,385$ rows and $1,630$ columns

In [7]:
sample_data = pd.read_parquet("/kaggle/input/asl-fingerspelling/supplemental_landmarks/1032110484.parquet")
sample_data

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3505, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_20/658195339.py", line 1, in <module>
    sample_data = pd.read_parquet("/kaggle/input/asl-fingerspelling/supplemental_landmarks/1032110484.parquet")
  File "/opt/conda/lib/python3.10/site-packages/pandas/io/parquet.py", line 501, in read_parquet
    )
  File "/opt/conda/lib/python3.10/site-packages/pandas/io/parquet.py", line 52, in get_engine
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - No module named 'pandas.core.arrays.arrow.extension_types'
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

During handling of the ab

# 4 | Advisory

This data has both `video` as an `input` and `language` as `output`. One lead (as of I think) is we can use R-CNN(Reccurent-Convolution Neural Network). Video data works great on these types of models. 

Also we need to reduce the training time as much as possible. As of the competition says. 

**THAT IT FOR TODAY GUYS**

**WE WILL GO DEEPER INTO THE DATA IN THE UPCOMING VERSIONS**

**PLEASE COMMENT YOUR THOUGHTS, HIHGLY APPRICIATED**

**DONT FORGET TO MAKE AN UPVOTE, IF YOU LIKED MY WORK**

<img src = "https://i.imgflip.com/19aadg.jpg">

**PEACE OUT!!!**
